Jee Jee Li
b6553be1bc
[Misc] Slight improvement of the BNB ( #19418 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-10 13:51:49 +00:00
youkaichao
64a9af5afa
Simplify ep kernels installation ( #19412 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-06-10 20:06:08 +08:00
Li, Jiang
e4248849ec
[BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral ( #19411 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-06-10 12:02:40 +00:00
Rachel Guo
467bef18a3
[BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword use_irope ( #19134 )
...
Signed-off-by: Yunqiu Guo <guorachel@meta.com >
2025-06-10 16:48:51 +08:00
Isotr0py
5f1ac1e1d1
Revert "[v1] Add fp32 support to v1 engine through flex attn" ( #19404 )
2025-06-10 01:30:20 -07:00
Louie Tsai
9368cc90b2
Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. ( #17930 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-06-10 06:22:05 +00:00
Anna Pendleton
32b3946bb4
Add clear documentation around the impact of debugging flag ( #19369 )
...
Signed-off-by: Anna Pendleton <pendleton@google.com >
2025-06-10 06:16:09 +00:00
Reid
6b1391ca7e
[Misc] refactor neuron_multimodal and profiling ( #19397 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-10 06:12:42 +00:00
Russell Bryant
a3f66e75d1
Add security warning to bug report template ( #19365 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-06-10 06:06:36 +00:00
Lukas Geiger
319cb1e351
[Core] Batch multi modal input using pinned memory ( #19169 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-10 13:44:59 +08:00
Li Wang
1efef71645
[Bugfix] Fix modelscope token passed in ( #19389 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-10 13:39:37 +08:00
Nick Hill
646d62f636
[Core] Use tuple for kv cache group block ids ( #19175 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-10 07:01:17 +02:00
Reid
6cd4ae8acd
[Frontend] Add tqdm_leave_pbar to control progress bar visibility ( #19357 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-10 04:55:09 +00:00
Harry Mellor
c016047ed7
Fix docs/mkdocs/hooks/remove_announcement.py ( #19382 )
2025-06-09 21:36:54 -07:00
XiongfeiWei
9af6d22e4c
Use xla flag to improve the quantized model performance ( #19303 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-06-10 01:28:45 +00:00
Tianyu Guo
4589b94032
[Bugfix] Fix benchmark_moe.py ( #19016 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
2025-06-09 18:04:36 -07:00
Ye (Charlotte) Qi
cc867be19c
[V1] Reuse V0's memory_profiling util for gpu worker memory profiling ( #19312 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-06-10 08:40:01 +08:00
Siyuan Liu
3a7cd627a8
[Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration ( #19383 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-06-09 16:41:51 -07:00
Pavani Majety
8058c91108
[HOT-FIX] Add kv_sharing_target_layer_name argument to cutlass_mla backend ( #19374 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-06-09 19:00:07 -04:00
Siyuan Liu
7d44c469fe
[TPU]Fix KV cache sharing tests ( #19371 )
2025-06-09 18:38:15 -04:00
liusiqian-tal
31f58be96a
[Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var ( #18472 )
...
Signed-off-by: liusiqian <liusiqian@tal.com >
2025-06-09 21:41:21 +00:00
Kyle Sayers
ebb2f383b8
[Quantization] Bump compressed-tensors version ( #19295 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-06-09 14:33:15 -07:00
22quinn
c1c7dbbeeb
[Bugfix][Core] Prevent token lengths exceeding max_model_len in V0 ( #19348 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-09 23:01:29 +08:00
Varun Sundar Rabindranath
5cf2daea9a
[Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. ( #19298 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Co-authored-by: Varun <vsundarr@redhat.com >
2025-06-09 10:50:39 -04:00
Isotr0py
b8089195b4
[v1] Add fp32 support to v1 engine through flex attn ( #19319 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-06-09 22:10:44 +08:00
Yinghai Lu
770e5dcdb8
[full_graph] Fix query_start_loc padding ( #19321 )
...
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai >
2025-06-09 21:32:56 +08:00
Michael Yao
c57c9415b1
[Docs] Fix a bullet list in usage/security.md ( #19358 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-06-09 13:28:51 +00:00
Lu Fang
01810f9236
[CI] Introduce rules for llama auto-label ( #19323 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-09 20:05:42 +08:00
Conroy Cheers
59abbd84f9
[Fix] Allow kernel compilation for CUDA capability 8.7 ( #19328 )
...
Signed-off-by: Conroy Cheers <conroy@corncheese.org >
2025-06-09 02:57:23 -07:00
Jee Jee Li
95a6568b5c
[CI/Build] Fix LoRA test ( #19350 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-09 09:52:10 +00:00
Se7en
0eca5eacd0
[Doc] Fix description in the Automatic Prefix Caching design doc ( #19333 )
...
Signed-off-by: cr7258 <chengzw258@163.com >
2025-06-09 17:30:02 +08:00
Reid
12e5829221
[doc] improve ci doc ( #19307 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-09 07:26:12 +00:00
Richard Zou
3a4d417707
[Misc] Cleanup compilation tests ( #19343 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-09 15:05:44 +08:00
Kseniya Parkhamchuk
8335667c22
[Frontend] Remove unreachable code from llm.py ( #19288 )
...
Signed-off-by: KsuParkhamchuk <k.parkhamchuk@gmail.com >
2025-06-09 10:22:10 +08:00
Isotr0py
e1c4380d4c
[Misc] Add documentation update reminder to PR template ( #19289 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-09 10:20:53 +08:00
Cyrus Leung
e31ae3de36
[Deprecation] Remove inputs arg fallback in Engine classes ( #18799 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-09 10:19:56 +08:00
wang.yuqi
2ffb9b6e07
[Bugfix] model_max_length should consider max_model_len in tokenizer_config ( #19201 )
2025-06-08 07:17:53 -07:00
jennyyyyzhen
cda10fa3e2
[Multi Modal] Add an env var for message queue max chunk bytes ( #19242 )
...
Signed-off-by: yZhen <yZhen@fb.com >
Co-authored-by: yZhen <yZhen@fb.com >
2025-06-08 21:39:12 +08:00
Dipika Sikka
c123bc33f9
[Quantization] Add compressed-tensors NVFP4 support ( #18312 )
2025-06-08 09:05:55 -04:00
Akash kaothalkar
b9a1791e2c
[Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection ( #19082 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
2025-06-08 09:17:14 +00:00
Xu Wenqing
989dcee981
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B ( #19315 )
...
Signed-off-by: Xu Wenqing <xuwq1993@qq.com >
2025-06-08 16:07:02 +08:00
Richard Zou
3d64d366e0
[Misc] Change tests/compile to use VLLM_V1 by default ( #19302 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-08 16:06:48 +08:00
Richard Zou
eaa2e51088
[Bugfix] Re-enable use_cudagraph in vLLM v1 ( #19299 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-06-08 08:56:12 +08:00
Chauncey
d77f7fb871
[Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer ( #19283 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-08 08:16:31 +08:00
Luka Govedič
2d8476e465
[BugFix][V1] Fix memory profiling bug ( #18974 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-06-07 10:34:51 -07:00
pramenku
88be823d57
[AMD] Update compatible packaging version ( #19309 )
...
Signed-off-by: pramkuma <Pramendra.Kumar@amd.com >
2025-06-07 20:55:09 +08:00
Lifans
4e4f63ad45
[Nit][Benchmark]Fix example in benchmark_serving_structured_output.py ( #19311 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-06-07 18:25:38 +08:00
Isotr0py
d2f0e7e615
[CI/Build] Improve Llama GGUF test robustness ( #19287 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-07 17:23:28 +08:00
Reid
122cdca5f6
[Misc] refactor context extension ( #19246 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-07 05:13:21 +00:00
Driss Guessous
cf02f9b283
Add FlexAttention to V1 ( #16078 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-06-06 21:58:55 -07:00
Aaruni Aggarwal
c4296b1a27
[CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py ( #19253 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-06-07 11:52:52 +08:00
QiliangCui
66c508b137
[TPU][Test] Add script to run benchmark on TPU for buildkite ( #19039 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-06 20:10:24 -07:00
ElizaWszola
84166fee97
[Kernel] Integrate CUTLASS MoE kernel with PPLX ( #18762 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-06 18:26:11 -07:00
Lu Fang
6e0cd10f72
[Easy][Test] Simplify test_function_tool_use with multiple parametrizes ( #19269 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-07 09:19:09 +08:00
Alexei-V-Ivanov-AMD
e010688f50
[Build][ROCm] Update Dockerfile.rocm ( #19296 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-06-06 19:35:16 -04:00
Chenyaaang
441b65d8c7
[Misc][Tools][Benchmark] Fix and improve auto tune script ( #19163 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-06 23:31:19 +00:00
Nick Hill
46ecc57973
[BugFix] Fix tpu_model_runner block_id concatenation ( #19228 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-06 16:28:17 -07:00
Nicolò Lucchesi
b6a3a9f76d
[Core] Fix abrupt request abort ( #18485 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-06 16:27:59 -07:00
Adolfo Victoria
ca27f0f9c1
[Bugfix][Core] Update cancellation logic in generate() to handle Generator exits ( #19225 )
...
Co-authored-by: Adolfo Victoria <adovi@meta.com >
2025-06-06 20:17:54 +00:00
Nick Hill
aad30bd306
[BugFix] Fix MultiConnector test after HMA changes ( #19291 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-06 20:16:24 +00:00
Nishidha
94ecee6282
Fixed ppc build when it runs on non-RHEL based linux distros ( #18422 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
Co-authored-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
2025-06-06 11:54:26 -07:00
Yu Guo
8267f9916f
improve logits bias ( #19041 )
2025-06-06 19:59:25 +08:00
jmswen
7353492a47
[Core] Raise when non-multi-instance DP clients target a DP rank ( #19227 )
...
Signed-off-by: Jon Swenson <jmswen@gmail.com >
2025-06-06 19:03:01 +08:00
Jee Jee Li
7661e92ef8
[Model] Optimize nemotron_h implementation ( #19249 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-06 10:05:14 +00:00
Siqi Yan
f168b85725
Unit Test for run_dp_sharded_vision_model ( #19103 )
...
Signed-off-by: Siqi Yan <siqi@meta.com >
Co-authored-by: Siqi Yan <siqi@meta.com >
2025-06-06 16:24:02 +08:00
Richard Zou
da511d54d8
Fix CompilationConfig repr ( #19091 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-06 16:23:35 +08:00
Nick Hill
65c69444b1
[Docs] Improve V1 KVConnector interface documentation ( #19172 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-06 16:22:45 +08:00
Dipika Sikka
94870359cd
[Quantization] Bump compressed-tensors version; update NVFP4A16 test model ( #19224 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
2025-06-06 01:21:54 -07:00
Chengji Yao
0d49483ea9
[TPU] fix kv cache dtype in model runner ( #19244 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-06 16:20:16 +08:00
Jinghui Zhang
90b78ec5f9
[v1][P/D] Fix a edge case in kv cache schedule ( #19182 )
...
Co-authored-by: jinghui <jinghui@fb.com >
2025-06-05 23:32:55 -07:00
Aaron Pham
91a2ef98ea
[Chore] update CODEOWNERS ( #19247 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-06-06 06:09:43 +00:00
Xu Song
3da2313d78
Support allowed_token_ids in ChatCompletionRequest ( #19143 )
...
Signed-off-by: Xu Song <xusong.vip@gmail.com >
2025-06-06 05:06:48 +00:00
Chengji Yao
b61dc5f972
[TPU] update torch_xla pin ( #19231 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-06 04:27:38 +00:00
Chen Zhang
f8a1a2d108
[v1] Hybrid Memory Allocator ( #17996 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-05 20:47:09 -07:00
Benjamin Chislett
3465b87ef8
[Bugfix] Fix EAGLE vocab embedding construction for Llama 70B ( #19033 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-06-05 19:10:08 -07:00
Jerry Zhang
c8134bea15
Fix AOPerModuleConfig name changes ( #18869 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-06-05 18:51:32 -07:00
Luis Vega
cb6d572e85
[Model] NemotronH support ( #18863 )
...
Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com >
Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com >
2025-06-05 21:29:28 +00:00
Michael Goin
87360308b7
[V1] Use FlashInfer by default on Blackwell GPUs ( #19118 )
2025-06-05 15:40:39 -04:00
Dipika Sikka
aa49f14832
[Quantization] Skip Fp4 Test for compressed-tensors ( #19217 )
2025-06-05 18:21:53 +00:00
Nicolò Lucchesi
9ef9173cfa
[P/D][NixlConnector] Enable FlashInfer backend ( #19090 )
2025-06-05 17:10:15 +00:00
Povilas Kanapickas
85e2b7bb13
[MISC][Bugfix] Use less CPU when message queue has been empty for some time ( #16226 )
...
Signed-off-by: Povilas Kanapickas <povilas@radix.lt >
2025-06-05 16:53:08 +00:00
Chiyue Wei
61059bee40
[Hardware][NVIDIA] FP4 MoE kernel optimization ( #19110 )
...
Signed-off-by: Chiyue Wei <chiyuew@nvidia.com >
Co-authored-by: Chiyue Wei <chiyuew@nvidia.com >
2025-06-05 09:48:26 -07:00
Xu Wenqing
ec89524f50
Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 ( #19205 )
2025-06-05 16:38:54 +00:00
Patrick von Platen
f20f9f063b
[mistral_common] Add v11 tokenizer ( #19193 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-06-05 08:27:41 -07:00
Guillaume Calmettes
9bc8bb07cf
[Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided ( #19202 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-06-05 12:59:28 +00:00
Reid
1aeb925f34
[Frontend] improve vllm run-batch --help display ( #19187 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-05 11:16:25 +00:00
22quinn
188a4590d8
[Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly ( #19105 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-05 11:14:32 +00:00
vllmellm
18093084be
[Misc] Remove unnecessary fallback to prefill-decode attention ( #19138 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-06-05 16:08:26 +08:00
Simon Mo
da40380214
[Build] Annotate wheel and container path for release workflow ( #19162 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-04 23:24:56 -07:00
Chauncey
8fc57501d3
[Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled ( #19135 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-05 06:24:24 +00:00
Woosuk Kwon
af7fc84fd2
[BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 ( #19171 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-05 13:41:25 +08:00
Huy Do
0678b52251
Handle non-serializable objects when dumping benchmark results ( #19114 )
2025-06-04 22:40:04 -07:00
Yang Wang
25b918eee6
[Torch Nightly]add missing dependency ( #18770 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-06-04 21:56:12 -07:00
Michael Goin
a408820f2f
[Bugfix] Fix port handling in make_zmq_path ( #19117 )
2025-06-04 21:00:59 -06:00
Robert Shaw
c56ed8bb0e
[Bugfix][Nixl] Fix full prefix cache hit bug ( #18632 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-05 02:07:32 +00:00
Reid
78dcf56cb3
[doc] small fix ( #19167 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-05 09:13:50 +08:00
Nicolò Lucchesi
b2fac67130
[P/D] Heterogeneous TP ( #18833 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-06-04 23:25:34 +00:00
CYJiang
23027e2daf
[Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM ( #18817 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-06-04 15:37:25 -07:00
Varun Sundar Rabindranath
c3fd4d669a
[Kernel] Integrate batched/masked deepgemm kernel ( #19111 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Co-authored-by: Varun <vsundarr@redhat.com >
2025-06-04 21:59:18 +00:00
Kebe
ef3f98b59f
[Bugfix] fix v1 cpu worker fails on macOS ( #19121 )
2025-06-04 20:17:38 +00:00
Siyuan Liu
7ee2590478
[TPU] Update dynamo dump file name in compilation test ( #19108 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-06-04 16:13:43 -04:00
Michael Goin
53a5a0ce30
[Perf] Tunings for SM100 FP8 CUTLASS kernel ( #18778 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-04 10:46:28 -07:00
Tyler Michael Smith
d459fae0a2
[Bugfix][EP+DP] Fix internode check ( #19112 )
...
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
2025-06-04 23:39:23 +08:00
jmswen
c8dcc15921
Allow AsyncLLMEngine.generate to target a specific DP rank ( #19102 )
...
Signed-off-by: Jon Swenson <jmswen@gmail.com >
2025-06-04 08:26:47 -07:00
Cyrus Leung
8f4ffbd373
[Doc] Update V1 Guide for embedding models ( #19141 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-04 22:57:55 +08:00
Lain
5f2cd251d2
Sm100 blockwise fp8 swap ab ( #18564 )
2025-06-04 07:48:45 -07:00
Xu Wenqing
02658c2dfe
Add DeepSeek-R1-0528 function call chat template ( #18874 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
2025-06-04 13:24:18 +00:00
Cyrus Leung
01dc9a76db
[CI/Build][Bugfix] Ensure compatibility with transformers 4.52 ( #18678 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-04 04:49:20 -07:00
wang.yuqi
35cf32df30
Improve the output precision of embedding models ( #19092 )
2025-06-04 11:48:57 +00:00
Isotr0py
8711bc5e68
[Misc] Add packages for benchmark as extra dependency ( #19089 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-04 04:18:48 -07:00
Seiji Eicher
2669a0d7b5
Fix ValueError: Missing value for tag key(s): model_name,engine. ( #19113 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-06-04 17:10:45 +08:00
Siyuan Liu
8e972d9c44
[TPU] Skip hanging tests ( #19115 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-06-04 01:43:00 -07:00
汪志鹏
3336c8cfbe
Fix #19130 ( #19132 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-04 01:42:06 -07:00
Woosuk Kwon
b124e1085b
[Bugfix] Fix FA3 full cuda graph correctness ( #19106 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-03 23:10:15 -07:00
Kaixi Hou
41aa578428
[NVIDIA] Add Cutlass MLA backend ( #17625 )
2025-06-03 21:40:26 -07:00
Calvin Chen
8d646c2e53
[Cleanup][v1]:remote guided-decoding-backend for example ( #19059 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-06-04 04:23:26 +00:00
Vadim Gimpelson
5d6d1adf15
[KERNEL] Sampler. CUDA kernel for applying repetition penalty ( #18437 )
2025-06-03 21:13:01 -07:00
Lukas Geiger
1409ef9134
[Core] Cast multimodal input in hf processor ( #18862 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-03 20:24:56 -07:00
Li, Jiang
4555143ea7
[CPU] V1 support for the CPU backend ( #16441 )
2025-06-03 18:43:01 -07:00
Russell Bryant
52dceb172d
[Docs] Add developer doc about CI failures ( #18782 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-06-04 01:09:13 +00:00
Jiaxin Shan
abd7df2fca
[Misc] Fix path and python alias errors in disagg_prefill exmaples ( #18919 )
2025-06-03 17:15:18 -07:00
Yan Ru Pei
b712be98c7
feat: add data parallel rank to KVEventBatch ( #18925 )
2025-06-03 17:14:20 -07:00
Chen Zhang
a8da78eac9
[Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers ( #19029 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-04 00:14:06 +00:00
Nicolò Lucchesi
5d96533e22
[Bugfix][P/D] Fix Prefix Cache Bug ( #18411 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-06-03 23:53:16 +00:00
Chauncey
4de790fcad
[Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled ( #19075 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-03 23:27:24 +00:00
Chen Zhang
b5fd9506c1
[Bugfix] get_num_blocks_to_allocate with null_block ( #19031 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-03 15:30:55 -07:00
Ekagra Ranjan
135cf55cd1
[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix ( #18971 )
2025-06-03 15:26:33 -07:00
Chen Zhang
6cac54f4d1
[v1] Re-init input batch for multiple kv cache groups ( #18654 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-03 21:41:36 +00:00
Harry Mellor
6865fe0074
Fix interaction between Optional and Annotated in CLI typing ( #19093 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Yikun Jiang <yikun@apache.org >
2025-06-03 21:07:19 +00:00
Michael Goin
e31446b6c8
[Perf] Tune scaled_fp8_quant by increasing vectorization ( #18844 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-03 13:48:25 -07:00
Yong Hoon Shin
bdf13965ab
[V1] Support cross-layer KV sharing ( #18212 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-06-03 20:33:07 +00:00
Varun Sundar Rabindranath
fa98d77773
[Kernel] DeepEP dispatch-combine kernel integration ( #18434 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-06-03 12:30:02 -07:00
Reid
01eee40536
[doc] update docker version ( #19074 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-03 19:08:21 +00:00
SorenDreano
19bdaf32b1
[Doc] Readme standardization ( #18695 )
...
Co-authored-by: Soren Dreano <soren@numind.ai >
2025-06-03 11:50:55 -07:00
Simon Mo
02f0c7b220
[Misc] Add SPDX-FileCopyrightText ( #19100 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-06-03 11:20:17 -07:00
CYJiang
d054da1992
[Misc] fix: add miss best_of param validation ( #18555 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-06-03 11:02:07 -07:00
Nicolò Lucchesi
4b7817c119
[Misc] Add missing _Backend enums ( #19081 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-06-03 16:15:16 +00:00
Lu Fang
d00dd65cd4
[Doc] Improve the Pull Request template with key components ( #19086 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-03 23:44:34 +08:00
Raushan Turganbay
d81edded69
[Bugfix] disable processor cache ( #19068 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2025-06-03 15:06:04 +00:00
Harry Mellor
476844d44c
Fix underscores in dict keys passed via CLI ( #19030 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-06-03 14:39:24 +00:00
Jee Jee Li
4e68ae5e59
[CI/Build] Remove V0 LoRA test ( #19066 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-03 14:30:18 +00:00
youkaichao
4e88723f32
[doc] clarify windows support ( #19088 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-06-03 21:42:17 +08:00
Cyrus Leung
118ff92111
[Doc] Update V1 user guide for embedding and enc-dec models ( #19060 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-03 02:29:41 -07:00
Isotr0py
ec2dcd80bc
[Misc] Update WeightsMapper for qwen2-vl/qwen2.5-vl ( #19054 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-03 09:08:20 +00:00
Jee Jee Li
42243fbda0
[Doc] Add InternVL LoRA support ( #19055 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-03 09:08:03 +00:00
Michael Goin
6d18ed2a2e
Update docker docs with ARM CUDA cross-compile ( #19037 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-06-03 08:21:53 +00:00
Chen Zhang
f32fcd9444
[v1][KVCacheManager] Rename BlockHashType to BlockHash ( #19015 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-03 08:01:48 +00:00
Lu Fang
d32aa2e670
[Bugfix] Use cmake 3.26.1 instead of 3.26 to avoid build failure ( #19019 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-03 00:16:17 -07:00
Michael Goin
cc977286e7
Reduce logs in CLI scripts and plugin loader ( #18970 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-03 06:00:45 +00:00
Reid
17430e3653
[bugfix] small fix logic issue ( #18999 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-03 05:35:12 +00:00
汪志鹏
1282bd812e
Add tarsier model support ( #18985 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-03 13:13:13 +08:00
Rui Qiao
bdce64f236
[V1] Support DP with Ray ( #18779 )
2025-06-02 21:15:13 -07:00
Gregory Shtrasberg
9e6f61e8c3
[ROCm][Build] Clean up the ROCm build ( #19040 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-06-02 20:47:47 -07:00
Li, Jiang
8655f47f37
[CPU][CI] Re-enable the CPU CI tests ( #19046 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-06-02 20:46:47 -07:00
Concurrensee
4ce42f9204
Adding "LoRA Test %N" to AMD production tests ( #18929 )
...
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
2025-06-02 20:46:44 -07:00
Tyler Michael Smith
8a57872b2a
[Bugfix][EP+DP] Use pplx-kernel internode instead of intranode ( #19034 )
...
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-03 11:36:51 +08:00
Hyogeun Oh (오효근)
5bc1ad6cee
[Doc] Remove duplicate TOCs during MkDocs migration ( #19021 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-06-02 19:49:48 -07:00
Siyuan Liu
9112b443a0
[Hardware][TPU] Initial support of model parallelism with single worker using SPMD ( #18011 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Co-authored-by: Hossein Sarshar <hossein.sarshar@gmail.com >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
2025-06-03 00:06:20 +00:00
Calvin Chen
c57d577e8d
add an absolute path for run.sh ( #18258 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-06-02 19:38:23 +00:00
Gregory Shtrasberg
ca2f6b9c30
[Bugfix][Model] Attempt to fix eagle in V0. ( #18978 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-06-02 08:15:53 -07:00
Frαnçois
20133cfee2
[Frontend] enable custom logging for the uvicorn server (OpenAI API server) ( #18403 )
...
Signed-off-by: François Paupier <francois.paupier@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-06-02 15:04:23 +00:00
jennyyyyzhen
ebb1ec9318
[Model] enable data parallel for Llama4 vision encoder ( #18368 )
...
Signed-off-by: yzhen <yzhen@devgpu093.cco2.facebook.com >
Co-authored-by: yZhen <yZhen@fb.com >
Co-authored-by: yzhen <yzhen@devgpu093.cco2.facebook.com >
2025-06-02 19:22:54 +08:00
Reid
5b168b6d7a
[doc] add pytest tips ( #19010 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-02 11:07:26 +00:00
22quinn
9760fd8f6a
[Core] Support inplace model weights loading ( #18745 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-02 17:38:50 +08:00
Robert Shaw
b9f61e1387
[Bugfix][Nixl] Fix DP Metadata Handshake ( #19008 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-06-02 03:30:41 +00:00
zhrrr
d6fd3a33b8
[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context ( #18935 )
...
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-06-01 19:41:18 +00:00
Reid
432ec9926e
[doc] wrong output ( #19000 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-01 11:26:14 +00:00
Nick Hill
2b102d51ad
[BugFix] Fix incorrect metrics shutdown error log message ( #18992 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-01 11:42:23 +08:00
rongfu.leng
aa54a7bf7b
[BugFix] fix data parallel construct ipv6 url addres ( #18991 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-06-01 11:42:10 +08:00
Michael Goin
2ad6194a02
Let max_num_batched_tokens use human_readable_int for large numbers ( #18968 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-01 11:41:29 +08:00
Reid
c594cbf565
[doc] small fix - mkdocs ( #18996 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-31 20:23:43 -07:00
Isotr0py
a35ca765a5
[LoRA] Support dynamically initialize packed_modules_mapping for VLM with arbitrary components ( #18987 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-01 11:06:57 +08:00
Cyrus Leung
6aa8f9a4e7
[Core] Rework dtype resolution ( #18751 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-01 11:04:23 +08:00
Benjamin Chislett
1bc86a3da1
[Bugfix] Fix EAGLE3 broken logits ( #18909 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-05-31 19:58:07 -07:00
Ekagra Ranjan
bbfa0c61d1
[Misc][Benchmark] Add support for CustomDataset ( #18511 )
2025-05-31 19:07:38 +00:00
Reid
20079c6e36
[Misc] add return token strs for tokenize ( #18941 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-31 18:00:11 +00:00
Nick Hill
9a1b9b99d7
[BugFix] Fix multi-node offline data-parallel ( #18981 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com >
2025-05-31 08:34:52 -07:00
ptarasiewiczNV
8bf507d766
[P/D] NixlConnector use cache device index for memory registration ( #18969 )
...
Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com >
2025-05-31 11:19:18 -04:00
Charlie Fu
306d60401d
[ROCm][Kernel] Add gfx950 support for skinny gemms ( #18010 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-05-31 07:40:05 -07:00
Fred Reiss
f2c3f66d59
[Bugfix] Fix for issue 17396 ( #18773 )
...
Signed-off-by: Fred Reiss <frreiss@us.ibm.com >
2025-05-31 11:58:17 +00:00
vllmellm
0f5e0d567e
[FEAT][ROCm] Add AITER grouped topk for DeepSeekV2 ( #18825 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-31 03:39:31 -07:00
Luka Govedič
c55d804672
[BugFix] Pydantic part 2 ( #18911 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-05-31 03:39:28 -07:00
Reid
749f5bdd38
[doc] fix the list rendering issue - security.md ( #18982 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-31 10:39:21 +00:00
Satyajith Chilappagari
2a50ef5760
[Neuron] Add Multi-Modal model support for Neuron ( #18921 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com >
Co-authored-by: Rohith Nallamaddi <nalrohit@amazon.com >
Co-authored-by: FeliciaLuo <luof@amazon.com >
Co-authored-by: Elaine Zhao <elaineyz@amazon.com >
2025-05-31 10:39:11 +00:00
Lucia Fang
b8b904795d
fix security issue of logging llm output ( #18980 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-05-31 10:38:56 +00:00
Chauncey
ba5111f237
[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled ( #18879 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-31 09:20:54 +00:00
Yong Hoon Shin
1e123529d7
[Misc] Fix estimated max model len msg ( #18966 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-31 16:43:44 +08:00
Pooya Davoodi
dff80b0e42
[Frontend] Add rerank support to run_batch endpoint ( #16278 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-05-31 07:40:01 +00:00
Yu Guo
7782464a17
create util function for batched arange ( #18937 )
2025-05-31 13:50:38 +08:00
Lukas Geiger
0f71e24034
[Docs] Correct multiprocessing design doc ( #18964 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-31 01:30:15 +00:00
Will Eaton
1dab4d5718
Tool parser regex timeout handling ( #18960 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-05-30 21:02:54 +00:00
rongfu.leng
7f21e8052b
[Misc] add group_size is -1 in awq quantization ( #18910 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-05-30 17:34:22 +00:00
Isotr0py
5a8641638a
[VLM] Add PP support and fix GPTQ inference for Ovis models ( #18958 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-30 17:11:44 +00:00
Michael Goin
f49239cb45
Benchmark script for fp8 vs bf16 gemm ( #17126 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-30 10:56:11 -06:00
Nick Hill
2dbe8c0774
[Perf] API-server scaleout with many-to-many server-engine comms ( #17546 )
2025-05-30 08:17:00 -07:00
Richard Zou
84ec470fca
Improve "failed to get the hash of the compiled graph" error ( #18956 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-30 15:00:54 +00:00
Russell Bryant
b29ca5c4d5
[Docs] Update SECURITY.md with link to our security guide ( #18961 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-30 07:37:27 -07:00
Reid
ec6833c5e9
[doc] show the count for fork and watch ( #18950 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-30 06:45:59 -07:00
Shawn Huang
e1fadf1197
[Feature] minicpm eagle support ( #18943 )
...
Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com >
Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com >
2025-05-30 06:45:56 -07:00
Daniele
43ff405b90
[CI/Build] remove regex from build dependencies ( #18945 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-30 04:02:50 -07:00
Carol Zheng
fba02e3bd1
[Bugfix][TPU] Fix tpu model runner testcase failure ( #18810 )
...
Signed-off-by: Carol Zheng <cazheng@google.com >
2025-05-30 18:04:03 +08:00
Always-Naive
4577fc9abb
[Misc]Fix typo ( #18947 )
2025-05-30 02:21:35 -07:00
Rabi Mishra
5f1d0c8118
[Bugfix][Failing Test] Fix test_vllm_port.py ( #18618 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-30 17:13:47 +08:00
Lukas Geiger
c3bb9f2331
[Model] Use in-place adds in SigLIP ( #18922 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-30 17:12:59 +08:00
Reid
8f8900cee9
[doc] add mkdocs doc ( #18930 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-30 07:58:44 +00:00
Rabi Mishra
6acb7a6285
[Misc]Fix benchmarks/README.md for speculative decoding ( #18897 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-30 07:58:04 +00:00
Cyrus Leung
4f4a6b844a
[Deprecation] Remove mean pooling default for Qwen2EmbeddingModel ( #18913 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-30 06:53:37 +00:00
Michael Goin
4d0a1541be
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy ( #18861 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-30 13:37:36 +08:00
vllmellm
77b6e74fe2
[ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. ( #18938 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-29 22:33:17 -07:00
H
5acf828d99
[docs] fix: fix markdown syntax ( #18927 )
2025-05-30 05:20:48 +00:00
iLeGend
3987e2ae96
[Model] Use AutoWeightsLoader for mamba2 ( #18918 )
...
Signed-off-by: iLeGend <824040212@qq.com >
2025-05-30 04:50:10 +00:00
Chauncey
77164dad5e
[Bugfix] Consistent ascii handling in tool parsers ( #18883 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-30 04:44:43 +00:00
Wenhua Cheng
3de3eadf5b
improve the robustness of parsing vlms config in AutoRound ( #18894 )
...
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com >
2025-05-29 19:24:47 -07:00
Carol Zheng
3132290a14
[TPU][CI/CD] Clean up docker for TPU tests. ( #18926 )
...
Signed-off-by: Carol Zheng <cazheng@google.com >
2025-05-30 10:24:19 +08:00
Cyrus Leung
1aa2f81b43
[Misc] Update type annotation for rotary embedding base ( #18914 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-30 10:17:01 +08:00
Michael Goin
d54af615d5
[Bugfix] Fix PP default fallback behavior for V1 ( #18915 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-30 10:13:17 +08:00
Chengji Yao
a1cc9f33a3
[TPU] remove transpose ops in moe kernel ( #18923 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-05-29 23:00:11 +00:00
Richard Zou
a521ef06e5
Use standalone_compile by default in torch >= 2.8.0 ( #18846 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-30 06:41:58 +08:00
Will Eaton
64eaf5fe05
[P/D] NixlConnector DP fixes ( #18903 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-05-29 18:08:40 +00:00
Nick Hill
d1d61f3351
[BugFix] Make DP work with connector-delayed new requests ( #18559 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Will Eaton <weaton@redhat.com >
2025-05-29 18:04:18 +00:00
Nicolò Lucchesi
32ce3cf7c9
[V1] Allocate kv_cache with stride order for V1 ( #18775 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-05-29 17:54:16 +00:00
CYJiang
d58f9c7f7a
[Misc] Remove duplicate init for self.vllm_config ( #18896 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-29 17:26:07 +00:00
Cyrus Leung
c29034037d
[Deprecation] Disallow pos-args other than model when initializing LLM ( #18802 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-29 09:36:58 -07:00
Gregory Shtrasberg
1b7cfd5a36
[ROCm][V0][Attention] Revert to the previous FA triton kernel ( #18226 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-29 12:13:18 -04:00
Gregory Shtrasberg
da4b69d0b4
[Attention][V1] Toggle for v1 attention backend ( #18275 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-29 10:48:24 -04:00
Isotr0py
c9479b2920
[Bugfix] Fix the failing gte embedding test ( #18720 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-29 07:39:25 -07:00
Hyogeun Oh (오효근)
6f2909405e
[Doc] Fix codeblocks formatting in LoRA adapters documentation ( #18907 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-29 07:38:55 -07:00
Duyi-Wang
b169d5f7b6
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. ( #18692 )
...
Signed-off-by: Duyi-Wang <duyi.wang@intel.com >
2025-05-29 20:02:08 +08:00
Chenyaaang
f8977c233f
Fix an error in dummy weight loading for quantization models ( #18855 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-05-29 03:07:20 -07:00
Luka Govedič
f274581f44
[BugFix] Update pydantic to fix error on python 3.10 ( #18852 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-05-29 03:05:46 -07:00
Lukas Geiger
0b1447f890
[Bugfix] Ensure tensors are contiguous during serialisation ( #18860 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-29 03:05:20 -07:00
Nicolò Lucchesi
24d0ef8970
[Misc] Replace TODO in serving transcription ( #18895 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-05-29 02:58:14 -07:00
Jee Jee Li
7fcfd954ff
[Bugfix] Fix misleading information in the documentation ( #18845 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-29 02:54:14 -07:00
Reid
e740d07f07
[doc] add CLI doc ( #18871 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-29 09:51:36 +00:00
Michael Yao
a652e71dd0
[Doc] Remove redundant spaces from compatibility_matrix.md ( #18891 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-05-29 02:51:20 -07:00
Jee Jee Li
34d6c447c4
[LoRA] Add LoRA support for InternVL ( #18842 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-29 08:46:24 +00:00
Satyajith Chilappagari
972eddf7c9
[Neuron] Add multi-LoRA support for Neuron. ( #18284 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-29 16:41:22 +08:00
Brent Salisbury
fd7bb88d72
Fixes a dead link in nightly benchmark readme ( #18856 )
...
Signed-off-by: Brent Salisbury <bsalisbu@redhat.com >
2025-05-29 04:41:39 +00:00
Yikun Jiang
3c49dbdd03
Skip device and quant Pydantic validation to make plugin device work ( #18843 )
...
Signed-off-by: Yikun Jiang <yikunkero@gmail.com >
2025-05-28 20:12:30 -07:00
aws-elaineyz
1661a9c28f
[Doc][Neuron] Update documentation for Neuron ( #18868 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
2025-05-28 19:44:01 -07:00
Chengji Yao
8e882ffdc0
[Bugfix][TPU] fix moe custom kernel import ( #18853 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-05-28 19:34:19 -07:00
Richard Zou
26b4fa45be
Add ability to use CUDAGraphs with use_inductor=False ( #17345 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-29 10:16:52 +08:00
Maximilien de Bayser
515b413ebf
Prevent the cross-encoder logic from being applied to classification tasks ( #18838 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-28 19:16:17 -07:00
Hongxia Yang
269d901734
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix ( #18100 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-29 07:21:46 +08:00
Varun Sundar Rabindranath
7951d78738
[Core] Enable CUDA graphs for DP + All2All kernels ( #18724 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-05-28 22:55:30 +00:00
Harry Mellor
6dbe5b5c93
Remove checks for None for fields which should never be None ( #17985 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-28 21:32:19 +00:00
Akshat Tripathi
643622ba46
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend ( #15655 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Signed-off-by: xihajun <junfan@krai.ai >
Signed-off-by: Jorge de Freitas <jorge.de-freitas22@imperial.ac.uk >
Signed-off-by: Jorge de Freitas <jorge@krai.ai >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: xihajun <junfan@krai.ai >
Co-authored-by: Jorge de Freitas <jorge.de-freitas22@imperial.ac.uk >
Co-authored-by: Jorge de Freitas <jorge@krai.ai >
2025-05-28 19:59:09 +00:00
Aaron Pham
a09c7ca9f2
[Chore][Spec Decode] Update check NoneType instead of assigning variables ( #18836 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-28 18:57:19 +00:00
Mark McLoughlin
0e98964e94
[V1][Metrics] Remove metrics that were deprecated in 0.8 ( #18837 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-28 18:54:12 +00:00
rongfu.leng
c68b5c63eb
[Misc] fix olmoe model layer can't laod in tp gt 1 ( #18828 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-05-28 17:36:21 +00:00
Aaron Pham
fced756923
[Chore] update ty configuration ( #18839 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-28 08:59:11 -07:00
Alex Brooks
321331b8ae
[Core] Add Lora Support to Beam Search ( #18346 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-05-28 08:58:24 -07:00
daniel-salib
6e4cea1cc5
decrement server_load on listen for disconnect ( #18784 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2025-05-28 22:15:12 +08:00
Reid
435fa95444
[Frontend] add run batch to CLI ( #18804 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-28 07:08:57 -07:00
Harry Mellor
4c2b38ce9e
Enable Pydantic mypy checks and convert configs to Pydantic dataclasses ( #17599 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-28 12:46:04 +00:00
Mengqing Cao
d781930f90
[Platform][Dist] Make torch distributed process group extendable ( #18763 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-28 10:52:34 +00:00
Lucas Wilkinson
ce75efeecb
[BugFix] FA2 MLA Accuracy Issue ( #18807 )
...
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
2025-05-28 08:59:39 +00:00
Richard Zou
aa42561e40
Fix PiecewiseCompileInterpreter ( #17338 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-28 08:40:53 +00:00
wang.yuqi
de65fc8e1e
[CI] improve embed testing ( #18747 )
2025-05-28 00:16:35 -07:00
Cyrus Leung
0c492b7824
[Deprecation] Remove fallbacks for Embeddings API ( #18795 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-28 15:09:04 +08:00
Cyrus Leung
0f0926b43f
[Deprecation] Remove unused sync methods in async_timeout ( #18792 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-28 15:08:48 +08:00
Cyrus Leung
7f2c1a87e9
[Deprecation] Require overriding get_dummy_text and get_dummy_mm_data ( #18796 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-28 15:08:35 +08:00
Rabi Mishra
b78f844a67
[Bugfix][FailingTest]Fix test_model_load_with_params.py ( #18758 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-28 05:42:54 +00:00
RonaldBXu
5e13c07d00
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) ( #18781 )
...
Signed-off-by: Ronald Xu <ronaldxu@amazon.com >
2025-05-28 05:09:14 +00:00
Divakar Verma
774c5fde30
[V1] fix torch profiling for V1 offline scenarios ( #18445 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-05-28 04:16:30 +00:00
Guillaume Calmettes
9a21e331ff
[Bugfix]: correctly propagate errors message caught at the chat_templating step to the client ( #18769 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-05-28 03:35:43 +00:00
wang.yuqi
3e9ce609bd
[Bugfix] Fix nomic max_model_len ( #18755 )
2025-05-27 20:29:53 -07:00
fxmarty-amd
794ae1f551
[rocm] Fix wrong attention log ( #18764 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
2025-05-27 19:45:41 -07:00
Lukas Geiger
d73a9457a5
[Core] Improve Tensor serialisation ( #18774 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-28 09:46:21 +08:00
Luka Govedič
a3896c7f02
[Build] Fixes for CMake install ( #18570 )
2025-05-27 20:49:24 -04:00
cascade
51e98e4ffd
[Bugfix] Disable prefix caching by default for benchmark ( #18771 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-28 08:18:09 +08:00
Michael Goin
e56f44d9ec
Support datasets in vllm bench serve and sync with benchmark_[serving,datasets].py ( #18566 )
2025-05-27 19:59:48 -04:00
Satyajith Chilappagari
e0cbad4e30
[Neuron] Support quantization on neuron ( #18283 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-27 22:10:33 +00:00
Carol Zheng
b48d5cca16
[CI/Build] [TPU] Fix TPU CI exit code ( #18282 )
...
Signed-off-by: Carol Zheng <cazheng@google.com >
2025-05-27 14:54:59 -07:00
Michael Goin
5873877241
[Bugfix] Mistral tool calling when content is list ( #18729 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-27 09:05:37 -07:00
Cyrus Leung
696259ca01
[Core] Automatically cast multi-modal input dtype ( #18756 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 23:45:48 +08:00
chunxiaozheng
6b6d496114
optimize get_kv_cache_torch_dtype ( #18531 )
...
Signed-off-by: idellzheng <idellzheng@tencent.com >
2025-05-27 13:08:44 +00:00
cascade
aaa4ac1c95
Disable prefix cache by default for benchmark ( #18639 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-27 20:06:34 +08:00
Mark McLoughlin
06a0338015
[V1][Metrics] Add API for accessing in-memory Prometheus metrics ( #17010 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-27 09:37:06 +00:00
Cyrus Leung
4318c0559d
[CI/Build] Remove imports of built-in re ( #18750 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 09:19:18 +00:00
Hyogeun Oh (오효근)
a68e293cb9
[Doc] Convert Sphinx directives ( {class}, {meth}, {attr}, ...) to MkDocs format for better documentation linking ( #18663 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-27 01:44:20 -07:00
Shawn Huang
6881107948
[BUG FIX] minicpm ( #18739 )
...
Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com >
Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com >
2025-05-27 01:04:49 -07:00
Kebe
e0f0ff87b8
[Build] fix cpu build missing libtbbmalloc.so ( #18744 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-05-27 01:03:56 -07:00
maobaolong
c24b1572ac
Minor fix about MooncakeStoreConnector ( #18721 )
...
Signed-off-by: baoloongmao <baoloongmao@tencent.com >
2025-05-27 08:02:28 +00:00
Calvin Chen
4693a3438c
[Doc] cleanup deprecated flag for doc ( #18715 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-27 07:12:02 +00:00
Łukasz Durejko
bbd9a84dc5
[Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh ( #18752 )
...
Signed-off-by: Lukasz Durejko <ldurejko@habana.ai >
2025-05-27 00:10:26 -07:00
almersawi
a547aeb828
feat(rocm-support): support mamba2 on rocm ( #18565 )
...
Signed-off-by: Islam Almersawi <islam.almersawi@openinnovation.ai >
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai >
2025-05-27 00:07:53 -07:00
Reid
fc6d0c290f
[Misc] improve docs ( #18734 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-27 07:07:01 +00:00
Cyrus Leung
753944fa9b
[Doc] Update reproducibility doc and example ( #18741 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 07:03:13 +00:00
Cyrus Leung
25a817f202
[Doc] Update OOT model docs ( #18742 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 06:30:31 +00:00
vllmellm
d260f799a9
[FEAT] [ROCm] Upgrade AITER Fused MoE kernels. ( #18271 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-26 23:14:07 -07:00
Lukas Geiger
b50602d5f0
[Model][Gemma3] Cast image pixel values already on CPU ( #18732 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-27 05:42:54 +00:00
Isotr0py
1f1b1bc03b
[V1][Quantization] Add CUDA graph compatible v1 GGUF support ( #18646 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-27 04:40:28 +00:00
Reid
1f88dbd2bb
[Misc] improve web section group title display ( #18684 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-27 04:35:16 +00:00
Lukas Geiger
0eebd74842
[Model][Gemma3] Simplify image input validation ( #18710 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-27 11:13:37 +08:00
Harry Mellor
27bebcd897
Convert examples to ruff-format ( #18400 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-26 16:57:54 +00:00
Lukas Geiger
e7523c2e03
[V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs ( #18608 )
2025-05-26 11:49:36 -04:00
Cyrus Leung
a869baca73
[Bugfix] Fix Llama GGUF initialization ( #18717 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 07:49:22 -07:00
Cyrus Leung
82e2339b06
[Doc] Move examples and further reorganize user guide ( #18666 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 07:38:04 -07:00
Cyrus Leung
9553fdb41e
[Doc] Improve API docs ( #18713 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 07:33:34 -07:00
dylan
243eb9199f
[Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM ( #18701 )
2025-05-26 07:10:56 -07:00
Reid
0665e29998
[Misc] add AutoGen integration ( #18712 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-26 13:56:18 +00:00
Łukasz Durejko
e76be06550
[Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI ( #18709 )
...
Signed-off-by: Lukasz Durejko <ldurejko@habana.ai >
2025-05-26 05:26:07 -07:00
Isotr0py
0877750029
[CI/Build] Split pooling and generation extended language models tests in CI ( #18705 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-26 04:00:08 -07:00
Naveassaf
6d68030f1c
[Model] Add support for YARN in NemotronNAS models ( #18427 )
...
Signed-off-by: Nave Assaf <nassaf@nvidia.com >
2025-05-26 10:31:49 +00:00
Ning Xie
5a2c76cbe1
[CI] fix dump_input for str type ( #18697 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-26 18:23:35 +08:00
Cyrus Leung
38b13dfe78
[CI/Build] Replace math.isclose with pytest.approx ( #18703 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 02:05:17 -07:00
Cyrus Leung
61a45e7a72
[Bugfix] Fix Mistral-format models with sliding window ( #18693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 01:44:04 -07:00
Cyrus Leung
65523a0995
[Doc] Fix issue template format ( #18699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 00:45:39 -07:00
Cyrus Leung
4b7740a105
[GH] Add issue template for reporting CI failures ( #18696 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 00:42:04 -07:00
Ning Xie
4ea62c0ea0
[CI] add missing argument ( #18694 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-26 00:22:04 -07:00
Maximilien de Bayser
561b77a0d6
[Bugfix] Fix the lm_head in gpt_bigcode in lora mode ( #6357 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
2025-05-26 14:52:25 +08:00
CYJiang
abd4030d94
refactor: simplify request handler, use positive condition check for handler assignment ( #18690 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-26 06:32:28 +00:00
AlexZhao
8820821b59
[Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example ( #18644 )
...
Signed-off-by: zhaohaidao <zhaohaidao2008@hotmail.com >
Signed-off-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com >
Co-authored-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com >
2025-05-26 13:51:27 +08:00
Cyrus Leung
fba0642704
[CI/Build][Doc] Update gte-Qwen2-1.5B-instruct usage ( #18683 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-05-25 20:27:50 -07:00
Lukas Geiger
6071e989df
[Core][Multimodal] Convert PIL Image to array without data copy when hashing ( #18682 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-25 17:33:35 +00:00
Cyrus Leung
57fd13a707
[Bugfix] Fix profiling dummy data for Pixtral ( #18677 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-25 14:05:30 +00:00
Reid
3a886bd58c
[Misc] small improve ( #18680 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 06:05:38 -07:00
Reid
35be8fad62
[CI/build] fix no regex ( #18676 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 10:10:51 +00:00
Yuqi Zhang
f2faac745d
[Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment ( #18674 )
...
Signed-off-by: zzzyq <zhangyuqi94@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-25 02:36:06 -07:00
Reid
279f854519
[doc] improve readability ( #18675 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 01:40:31 -07:00
Reid
624b77a2b3
[doc] fix broken links ( #18671 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 01:36:33 -07:00
Cyrus Leung
503f8487c2
[Misc] Reduce logs on startup ( #18649 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 23:03:53 -07:00
Ning Xie
44073a7ac3
[BUGFIX] catch subclass first for try...except ( #18672 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-25 05:34:24 +00:00
Michael Goin
63934543a0
Speed up the kernels/quantization/ tests ( #18669 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-25 05:02:59 +00:00
Isotr0py
75f81750f3
[VLM] Initialize video input support for InternVL models ( #18499 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-25 04:51:25 +00:00
Mengqing Cao
6ab681bcbe
[Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE ( #18655 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-05-25 04:51:21 +00:00
Chenguang Li
cebc22f3b6
[Misc]Replace cuda hard code with current_platform in Ray ( #14668 )
...
Signed-off-by: noemotiovon <757486878@qq.com >
2025-05-24 20:26:31 -07:00
Ning Xie
6c6dcd8611
[MISC] correct signature for LoaderFunction ( #18670 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-24 20:17:47 -07:00
Seiji Eicher
7891fdf0c6
[V1] Fix _pickle.PicklingError: Can't pickle <class 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite... ( #18640 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-05-24 20:07:20 -07:00
Woosuk Kwon
6825d9a998
[BugFix][Spec Decode] Improve Prefix Caching Logic in Speculative Decoding ( #18668 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-24 17:33:46 -07:00
Reid
b554ab736e
[CI/Build] fix permission denied issue ( #18645 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-24 16:09:10 +00:00
Aaron Pham
9ea7f1abf3
fix(regression): clone from reference items ( #18662 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-24 15:25:20 +00:00
Aaron Pham
2807271c86
[CI] enforce import regex instead of re ( #18665 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-24 08:04:14 -07:00
wangxiyuan
b9018a3f9f
[BugFix] Fix import error for fused_moe ( #18642 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-05-24 07:53:36 -07:00
Ning Xie
4ceafb6299
[MISC] typo fix and clean import ( #18664 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-24 07:52:09 -07:00
Cyrus Leung
2e6705784f
[CI/Build] chmod +x to cleanup_pr_body.sh ( #18650 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 07:26:45 -07:00
Cyrus Leung
1cb194a018
[Doc] Reorganize user guide ( #18661 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 07:25:33 -07:00
ztang2370
2cd4d58df4
[Model] use AutoWeightsLoader for gpt2 ( #18625 )
...
Signed-off-by: zt2370 <ztang2370@gmail.com >
2025-05-24 13:36:13 +00:00
Cyrus Leung
6d166a8d35
[Doc] Add community links ( #18657 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 06:06:38 -07:00
Cyrus Leung
ef1dd6870f
[Doc] Fix indentation problems in V0 Paged Attention docs ( #18659 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 06:06:35 -07:00
Mengqing Cao
e77dc4bad8
[MISC][pre-commit] Add pre-commit check for triton import ( #17716 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-24 20:09:15 +08:00
Cyrus Leung
07458a51ce
[Doc] Update README links, mark external links ( #18635 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 09:57:15 +00:00
qizixi
c1e4a4052d
[V1][Spec Decode] Support multi-layer eagle draft model ( #18030 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-05-24 09:45:34 +00:00
Yuanhao WU
a859320575
[Model] Add support for Qwen2.5-Omni-7B-AWQ (Qwen2_5OmniForConditionalGeneration) ( #18647 )
2025-05-24 09:15:36 +00:00
Reid
441dc63ac7
[Frontend] improve vllm serve --help display ( #18643 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-24 07:53:22 +00:00
qizixi
d55e446d13
[V1][Spec Decode] Small refactors to improve eagle bookkeeping performance ( #18424 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-05-24 06:51:22 +00:00
Wenhua Cheng
ec82c3e388
FIX MOE issue in AutoRound format ( #18586 )
...
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com >
2025-05-23 22:01:40 -07:00
Mathieu Borderé
45ab403a1f
config.py: Clarify that only local GGUF checkpoints are supported. ( #18623 )
...
Signed-off-by: Mathieu Bordere <mathieu@letmetweakit.com >
2025-05-24 08:46:34 +08:00
Robert Shaw
2b10ba7491
[Bugfix][Nixl] Fix Preemption Bug ( #18631 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-05-23 23:30:16 +00:00
Feng XiaoLong
4fc1bf813a
[Bugfix] Migrate to REGEX Library to prevent catastrophic backtracking ( #18454 )
...
Signed-off-by: Crucifixion-Fxl <xmufxl@gmail.com >
Co-authored-by: Crucifixion-Fxl <xmufxl@gmail.com >
2025-05-23 16:16:26 -07:00
Pavani Majety
f2036734fb
[ModelOpt] Introduce VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var to control blockscale tensor allocation ( #18160 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-05-23 15:52:20 -07:00
Cyrus Leung
7d9216495c
[Doc] Update references to doc files ( #18637 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 15:49:21 -07:00
Michael Goin
0ddf88e16e
[CI] Enable test_initialization to run on V1 ( #16736 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-23 15:09:44 -07:00
Huy Do
1645b60196
Use prebuilt FlashInfer x86_64 PyTorch 2.7 CUDA 12.8 wheel for CI ( #18537 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-05-23 21:17:16 +00:00
Jiayi Yao
2628a69e35
[V1] Support Deepseek MTP ( #18435 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn >
Co-authored-by: Rui Qiao <ruisearch42@gmail.com >
2025-05-23 10:26:28 -07:00
Cyrus Leung
371f7e4ca2
[Doc] Fix broken links and unlinked docs, add shortcuts to home sidebar ( #18627 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 10:22:40 -07:00
Cyrus Leung
15b45ffb9a
[Doc] Avoid documenting dynamic / internal modules ( #18626 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 09:58:02 -07:00
Cyrus Leung
273cb3b4d9
[Doc] Fix top-level API links/docs ( #18621 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 09:46:56 -07:00
David Xia
8ddd1cf26a
[Doc] fix list formatting ( #18624 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-23 09:41:17 -07:00
Chen Zhang
6550114c9c
[v1] Redo "Support multiple KV cache groups in GPU model runner ( #17945 )" ( #18593 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-23 09:39:47 -07:00
Michael Goin
9520a989df
[Docs] Change mkdocs to not use directory urls ( #18622 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-23 09:33:21 -07:00
Harry Mellor
3d28ad343f
Fix figures in design doc ( #18612 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 09:09:54 -07:00
youkaichao
6a7988c55b
Refactor pplx init logic to make it modular (prepare for deepep) ( #18200 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-05-23 23:43:43 +08:00
Cyrus Leung
022d8abe29
[Doc] Use a different color for the announcement ( #18616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 08:25:03 -07:00
Hyogeun Oh (오효근)
5221815a00
[Doc] Fix markdown list indentation for MkDocs rendering ( #18620 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-23 08:23:21 -07:00
Simon Mo
1068556b2c
[Bugfix][Build/CI] Fixup CUDA compiler version check for CUDA_SUPPORTED_ARCHS ( #18579 )
2025-05-23 07:43:58 -07:00
Reid
2cd1fa4556
[Misc] add Haystack integration ( #18601 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-23 06:21:19 -07:00
Harry Mellor
d4c2919760
Include private attributes in API documentation ( #18614 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 06:18:31 -07:00
Tristan Leclercq
6220f3c6b0
[Bugfix] Fix transformers model impl ignored for mixtral quant ( #18602 )
...
Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com >
2025-05-23 05:54:13 -07:00
Harry Mellor
52fb23f47e
Fix examples with code blocks in docs ( #18609 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 05:53:44 -07:00
Cyrus Leung
6dd51c7ef1
[CI/Build] Fix V1 flag being set in entrypoints tests ( #18598 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 05:51:53 -07:00
Harry Mellor
2edb533af2
Replace {func} with mkdocs style links ( #18610 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 05:51:38 -07:00
Hyogeun Oh (오효근)
38a95cb4a8
[Doc] Fix indent of contributing to vllm ( #18611 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-23 05:50:07 -07:00
Ning Xie
cd821ea5d2
[CI] fix kv_cache_type argument ( #18594 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-23 04:49:18 -07:00
Kay Yan
7ab056c273
[Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move to requirements/cpu.txt ( #18542 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-05-23 04:38:42 -07:00
Harry Mellor
6526e05111
Add myself as docs code owner ( #18605 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 04:08:31 -07:00
Madeesh Kannan
e493e48524
[V0][Bugfix] Fix parallel sampling performance regression when guided decoding is enabled ( #17731 )
...
Signed-off-by: Madeesh Kannan <shadeMe@users.noreply.github.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-05-23 03:38:23 -07:00
Mengqing Cao
4ce64e2df4
[Bugfix][Model] Fix baichuan model loader for tp ( #18597 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-23 02:39:05 -07:00
Cyrus Leung
fbb13a2c15
Revert "[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal ( #18034 )" ( #18600 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 02:18:22 -07:00
Harry Mellor
a1fe24d961
Migrate docs from Sphinx to MkDocs ( #18145 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 02:09:53 -07:00
Yuqi Zhang
d0bc2f810b
[Bugfix] Add half type support in reshape_and_cache_cpu_impl on x86 cpu platform ( #18430 )
...
Signed-off-by: Yuqi Zhang <yuqizhang@google.com >
Co-authored-by: Yuqi Zhang <yuqizhang@google.com >
2025-05-23 01:41:37 -07:00
Chauncey
b046cf792d
[Feature][V1]: suupports cached_tokens in response usage ( #18149 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-05-23 01:41:03 -07:00
Michael Goin
54af915949
[Doc] Update quickstart and install for cu128 using --torch-backend=auto ( #18505 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-23 08:36:37 +00:00
cascade
71ea614d4a
[Feature]Add async tensor parallelism using compilation pass ( #17882 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-23 01:03:34 -07:00
RonaldBXu
4c611348a7
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal ( #18034 )
...
Signed-off-by: Ronald Xu <ronaldxu@amazon.com >
2025-05-23 00:37:18 -07:00
Ning Xie
60cad94b86
[Hardware] correct method signatures for HPU,ROCm,XPU ( #18551 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-22 22:31:59 -07:00
Shanshan Shen
9c1baa5bc6
[Misc] Replace cuda hard code with current_platform ( #16983 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-05-23 04:38:50 +00:00
Teruaki Ishizaki
4be2255c81
[Bugfix][Benchmarks] Fix a benchmark of deepspeed-mii backend to use api_key ( #17291 )
...
Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com >
2025-05-23 12:30:47 +08:00
aws-elaineyz
ed5d408255
[Neuron] Remove bypass on EAGLEConfig and add a test ( #18514 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
2025-05-22 21:26:32 -07:00
Benjamin Chislett
583507d130
[Spec Decode] Make EAGLE3 draft token ID mapping optional ( #18488 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-22 20:17:39 -07:00
lkchen
e44d8ce8c7
[Bugfix] Set KVTransferConfig.engine_id in post_init ( #18576 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-05-23 02:54:42 +00:00
Nick Hill
93ecb8139c
[BugFix] Increase TP execute_model timeout ( #18558 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-23 10:22:11 +08:00
CYJiang
fae453f8ce
[Misc] refactor: simplify input validation and num_requests handling in _convert_v1_inputs ( #18482 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-23 10:15:32 +08:00
Harry Mellor
4b0da7b60e
Enable hybrid attention models for Transformers backend ( #18494 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 10:12:08 +08:00
Mark McLoughlin
c6b636f9fb
[V1][Spec Decoding] Use model_loader.get_model() to load models ( #18273 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-23 02:05:44 +00:00
Chenheli Hua
04eb88dc80
Re-submit: Fix: Proper RGBA -> RGB conversion for PIL images. ( #18569 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-05-23 01:59:18 +00:00
rasmith
46791e1b4b
[AMD] [P/D] Compute num gpus for ROCm correctly in run_accuracy_test.sh ( #18568 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-05-22 18:45:35 -07:00
Sanger Steel
c32e249a23
[Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization ( #17926 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
2025-05-22 18:44:18 -07:00
Kai Wu
c91fe7b1b9
[Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser ( #17917 )
...
Signed-off-by: Kai Wu <kaiwu@meta.com >
2025-05-22 16:44:08 -07:00
Ekagra Ranjan
a04720bc36
[V1][Spec Decode][Bugfix] Load quantize weights for EAGLE ( #18290 )
2025-05-22 15:17:33 -07:00
lkchen
7b9d832c80
[Tool] Add NIXL installation script ( #18172 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-05-22 14:33:16 -07:00
Tyler Michael Smith
6e588da0f4
[Build/CI] Fix CUDA 11.8 build ( #17679 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-22 12:13:54 -07:00
Mengqing Cao
f8d2cc5f55
[Compile][Platform] Make PiecewiseBackend pluggable and extendable ( #18076 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-05-22 12:11:53 -07:00
wangxiyuan
721fb9b181
[Platform] Move platform check to right place ( #18470 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-05-22 12:11:28 -07:00
David Xia
1f3a1200e4
[Bugfix] make test_openai_schema.py pass ( #18224 )
...
Signed-off-by: David Xia <david@davidxia.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-22 18:34:06 +00:00
Lukas Geiger
54631f8262
[Misc] Call ndarray.tobytes() directly instead of ndarray.data.tobytes() ( #18347 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-22 09:00:13 -07:00
Reid
cb506ecb5a
[Misc] improve Automatic Prefix Caching example ( #18554 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-22 14:50:46 +00:00
Li, Jiang
93f71673ce
[BugFix][CPU] Fix x86 SHM distributed module initialization ( #18536 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-05-22 07:35:00 -07:00
Calvin Chen
3f505233fd
[Doc] Add stream flag for chat completion example ( #18524 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-22 14:07:10 +00:00
Bowen Wang
4e04eceb58
[Bugfix] Use random hidden states in dummy sampler run ( #18543 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
2025-05-22 06:48:56 -07:00
CYJiang
71075029f2
[Doc] Support --stream arg in openai_completion_client.py script ( #18388 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-22 13:20:17 +00:00
Harry Mellor
ca86a7cf6e
[CI/Build] Update bamba test model location ( #18544 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-22 06:01:07 -07:00
lkchen
a35a494745
[Bugfix] Add kwargs to RequestOutput __init__ to be forward compatible ( #18513 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-05-22 05:24:43 -07:00
燃
f6037d1907
[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text ( #18526 )
...
Co-authored-by: 松灵 <wpf272043@alibaba-inc.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-22 05:22:53 -07:00
aws-elaineyz
fa72f9a812
Order sequence ids + config update to support specifying custom quantization layers ( #18279 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: Tailin Pan <tailinpa@amazon.com >
Co-authored-by: Rishabh Rajesh <rishyraj@amazon.com >
Co-authored-by: Yishan McNabb <yishanm@amazon.com >
Co-authored-by: Patrick Lange <patlange@amazon.com >
Co-authored-by: Maxwell Goldberg <mgld@amazon.com >
Co-authored-by: Aakash Shetty <sheaak@amazon.com >
2025-05-22 02:20:36 -07:00
aws-elaineyz
ebed81fbf5
Update default neuron config for speculation ( #18274 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: Shashwat Srijan <sssrijan@amazon.com >
Co-authored-by: Aakash Shetty <sheaak@amazon.com >
2025-05-22 02:18:55 -07:00
Satyajith Chilappagari
e2d7d31244
[Neuron] Update Dockerfile.neuron to use latest neuron release (2.23) ( #18512 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-22 02:17:34 -07:00
Cyrus Leung
23b67b37b2
[Doc] Fix invalid JSON in example args ( #18527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-22 07:11:46 +00:00
Jee Jee Li
db5a29ba19
[Bugfix] Fix LoRA test ( #18518 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-21 21:48:53 -07:00
Shane A
51797775c3
[Bugfix][Model] Make Olmo2Model weight loading return loaded weights ( #18504 )
...
Signed-off-by: Shane A <shanea@allenai.org >
2025-05-21 21:17:03 -07:00
Nick Hill
cf5984b2fe
[BugFix][DP] Send DP wave completion only from dp_rank==0 ( #18502 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: kourosh hakhamaneshi <kourosh@anyscale.com >
2025-05-21 20:25:25 -07:00
youngrok cha
d022115cc6
[Bugfix] Inconsistent token calculation compared to HF in llava family ( #18479 )
...
Signed-off-by: jaycha <jaycha@ncsoft.com >
2025-05-21 20:21:47 -07:00
Rabi Mishra
acb54ca8e1
Intialize io_thread_pool attribute in the beginning. ( #18331 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-21 20:21:14 -07:00
Russell Bryant
6e0fd34d3c
[CI] Fix race condition with StatelessProcessGroup.barrier ( #18506 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-21 20:19:13 -07:00
Ning Xie
176d62e4ea
[MISC] update project urls in pyproject.toml ( #18519 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-21 20:17:34 -07:00
Dhia Eddine Rhaiem
20bd6f4d2e
[FalconH1] Fix output dtype in RMSNorm fallback path for Falcon-H1 (e.g. 0.5B) ( #18500 )
...
Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae >
Co-authored-by: younesbelkada <younesbelkada@gmail.com >
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae >
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae >
2025-05-21 19:23:59 -07:00
Sebastian Schoennenbeck
1f079540db
[Bugfix] Consistent ascii handling in tool parsers ( #17704 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-05-21 20:41:23 +00:00
vllmellm
94d8ec8d2b
[FEAT][ROCm] Upgrade AITER MLA v1 backend ( #18338 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-05-21 10:34:28 -07:00
Mark McLoughlin
bb0a311213
Revert "[v1] Support multiple KV cache groups in GPU model runner ( #17945 ) ( #18459 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-21 10:25:23 -07:00
Hosang
dd5fa7e04f
[ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 ( #17004 )
...
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com >
2025-05-21 08:35:00 -07:00
Hyogeun Oh (오효근)
2b16104557
[Misc] Update deprecation message for --enable-reasoning ( #18404 )
2025-05-21 07:33:11 -07:00
Kebe
371376f996
[Build] fix Dockerfile shell ( #18402 )
2025-05-21 07:32:06 -07:00
bnellnm
c6c10ca920
[Bugfix] Reduce moe_sum test size to avoid OOM ( #18484 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-05-21 06:46:39 -07:00
GiantCroc
c154d89306
[Doc] fix arg docstring in linear layers ( #18410 )
...
Signed-off-by: giantcroc <1204449533@qq.com >
2025-05-21 06:45:57 -07:00
Dhia Eddine Rhaiem
eca18691d2
[MODEL] FalconH1 ( #18406 )
...
Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae >
Co-authored-by: younesbelkada <younesbelkada@gmail.com >
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae >
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae >
2025-05-21 04:59:06 -07:00
Rabi Mishra
61acfc45bc
[Bugfix][Failing Test] Fix test_events.py ( #18460 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-21 04:57:28 -07:00
Reid
107f5fc4cb
[Misc] refactor disaggregated-prefill-v1 example ( #18474 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-21 11:10:14 +00:00
Yong Hoon Shin
907f935de9
[V1] Fix general plugins not loaded in engine for multiproc ( #18326 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-21 01:21:49 -07:00
Kebe
5d7f545204
[Frontend] deprecate --device arg ( #18399 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-05-21 01:21:17 -07:00
Nicolò Lucchesi
cd8dfc6dfc
[Misc] MultiConnector._connectors type ( #18423 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-05-20 22:48:43 -07:00
wwl2755
d06dd72ba9
[Bugfix][Failing Test] Fix nixl connector test when promt size < block size ( #18429 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-05-20 22:41:44 -07:00
Cyrus Leung
ad0012a0ac
Revert "[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text ( #18407 )" ( #18456 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-20 22:39:22 -07:00
bnellnm
92247c522e
[Bug] Fix moe_sum signature ( #18440 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-05-20 22:37:08 -07:00
Gregory Shtrasberg
0c15c2e486
[Bugfix] config.head_dim is now explicitly set to None ( #18432 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-20 21:04:33 -07:00
Michael Goin
3b17ea26e4
[TPU] Re-enable the Pallas MoE kernel ( #18025 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-05-20 19:52:27 -07:00
Dilip Gowda Bhagavan
23baa2180b
fix:Build torch wheel inline rather than picking from nightly ( #18351 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
2025-05-20 22:22:24 +00:00
Percy
980a172474
[Kernel] update comment for KV shape in unified triton attn ( #18099 )
...
Signed-off-by: haochengxia <xhc_1007@163.com >
2025-05-20 11:19:34 -07:00
Calvin Chen
e1f5a71ed7
[Model] use AutoWeightsLoader for bloom ( #18300 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-20 09:40:05 -07:00
Michael Goin
f4a8a37465
[Minor] Rename quantization nvfp4 to modelopt_fp4 ( #18356 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-20 09:08:37 -07:00
Reid
8f55962a7f
[Misc] refactor prompt embedding examples ( #18405 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-20 15:26:12 +00:00
燃
be48360c1f
[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text ( #18407 )
...
Co-authored-by: 松灵 <wpf272043@alibaba-inc.com >
2025-05-20 06:59:48 -07:00
wang.yuqi
86847700d7
[CI] Add mteb testing to test the accuracy of the embedding model ( #17175 )
2025-05-20 06:51:12 -07:00
汪志鹏
d6c86d09ae
Update cpu.txt ( #18398 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-05-20 10:53:23 +00:00
Jee Jee Li
6b35cb10a0
[Misc] Add LoRA code owner ( #18387 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-20 03:27:30 -07:00
Reid
1b1e8e05ff
[doc] update env variable export ( #18391 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-20 08:53:27 +00:00
Random Fly
bca55b556f
[Bugfix] fix adding bias twice in ipex GPTQ quantization ( #18363 )
...
Signed-off-by: rand-fly <randfly@outlook.com >
2025-05-20 00:54:33 -07:00
Kevin H. Luu
d981396778
[release] Change dockerhub username for TPU release ( #18389 )
2025-05-19 23:49:23 -07:00
Nan Qin
9609327fa4
[Core] [Bugfix]: tensor parallel with prompt embeds ( #18171 )
...
Signed-off-by: Nan2018 <nan@protopia.ai >
Co-authored-by: Andrew Sansom <andrew@protopia.ai >
2025-05-19 20:21:27 -07:00
Isotr0py
f07a673eb2
[Misc] Allow AutoWeightsLoader to skip loading weights with specific substr in name ( #18358 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-19 20:20:12 -07:00
Liangfu Chen
d565e0976f
[neuron] fix authorization issue ( #18364 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-05-19 23:30:32 +00:00
Lucia Fang
258bf621d5
fix CUDA_check redefinition in #17918 ( #18287 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-05-19 13:42:35 -07:00
Satyajith Chilappagari
dc1440cf9f
Neuron up mistral ( #18222 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-19 09:54:47 -07:00
Gong Shufan
8171221834
[Misc] Fix typo ( #18330 )
2025-05-19 09:51:01 -07:00
sunyicode0012
7937c2fd52
Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) for DeepSeek V3/R1 on a single-node 8x NVIDIA H20 96GB setup ( #18337 )
2025-05-19 09:49:57 -07:00
Wenhua Cheng
e2ee1e8e9e
[Feature]Add support for models quantized with AutoRound ( #17850 )
...
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com >
2025-05-19 09:38:53 -07:00
Reid
20d8ce81eb
[Frontend] add --quick option for vllm chat/complete ( #18297 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-19 09:36:13 -07:00
Elad Segal
84ab4feb7e
[Doc] Fix typo ( #18355 )
2025-05-19 16:05:16 +00:00
Jee Jee Li
6781af5608
[Quantization] Pool model support bitsandbytes ( #18087 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-19 09:03:43 -07:00
Nick Hill
1b15df2546
[BugFix] Fix handling of num_computed_tokens with connector ( #18232 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2025-05-19 09:03:25 -07:00
Cyrus Leung
43b5f61dce
[Doc] Move input-related docs to Features ( #18353 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-19 15:08:39 +00:00
Li Wang
c5bb0ebdc6
[Doc] Fix prompt embedding examples ( #18350 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-05-19 06:48:16 -07:00
Shaoyu Yang
d637b96099
[BugFix] [Vul] Add missing usedforsecurity=False in MD5 hashing to enable FIPS ( #18319 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
Signed-off-by: shaoyuyoung <shaoyuyoung@gmail.com >
Co-authored-by: cascade <cascade812@outlook.com >
2025-05-19 01:31:23 -07:00
CYJiang
275c5daeb0
fix: Add type specifications for CLI arguments in tensorizer options ( #18314 )
2025-05-18 23:42:17 -07:00
Simon Mo
47fda6d089
[Build] Supports CUDA 12.6 and 11.8 after Blackwell Update ( #18316 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-05-18 23:19:33 -07:00
Reid
27d0952600
[Misc] extract parser.parse_args() ( #18323 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-19 04:06:26 +00:00
Nan Qin
221cfc2fea
Feature/vllm/input embedding completion api ( #17590 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: Nan2018 <nan@protopia.ai >
Co-authored-by: 临景 <linjing.yx@alibaba-inc.com >
Co-authored-by: Bryce1010 <bryceyx@gmail.com >
Co-authored-by: Andrew Sansom <andrew@protopia.ai >
Co-authored-by: Andrew Sansom <qthequartermasterman@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-18 20:18:05 -07:00
wwl2755
9da1095daf
[Spec Decode][V0] Fix spec decode correctness test in V0 eagle/medusa ( #18175 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-05-18 19:49:46 -07:00
Robin
d1211f8794
[Doc] Add doc to explain the usage of Qwen3 thinking ( #18291 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-05-18 23:04:07 +00:00
Reid
b6a6e7a529
[Misc] add litellm integration ( #18320 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-18 15:32:30 +00:00
Lifu Huang
4fb349f66a
Fix copy-paste error in phi4mm image processing ( #18315 )
...
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com >
2025-05-18 07:00:12 -07:00
22quinn
908733aca7
[Model] Use sigmoid for single-label classification ( #18313 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-05-18 07:00:09 -07:00
Reid
1a8f68bb90
[doc] update reasoning doc ( #18306 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-18 06:59:14 -07:00
cascade
9ab2c02ff8
Support sequence parallelism combined with pipeline parallelism ( #18243 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-17 22:47:25 +00:00
Ning Xie
66e63e86ec
[MISC] fix typo ( #18305 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-17 10:52:09 -07:00
rongfu.leng
9214e60631
[Model] use AutoWeightsLoader for solar ( #18113 )
2025-05-17 00:24:17 -07:00
Nishidha
f880d42582
Fixed build on ppc64le due to openssl conflicts ( #18262 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-05-17 00:23:46 -07:00
Michael Goin
dcfe95234c
Update Dockerfile to build for Blackwell ( #18095 )
2025-05-17 00:23:25 -07:00
Siyuan Liu
48ac2bed5b
[Hardware][TPU] Optionally import for TPU backend ( #18269 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
Co-authored-by: Carol Zheng <cazheng@google.com >
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com >
Co-authored-by: Hongmin Fan <fanhongmin@google.com >
2025-05-17 15:23:12 +08:00
David Ben-David
3e0d435027
[P/D][V1] Support dynamic loading of external KV connector implementations ( #18142 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-05-17 06:40:39 +00:00
汪志鹏
4ee4826ede
[BugFix] Correct max_model_len derivation from config.json for Mistral format ( #17937 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
Co-authored-by: tracelogfb <48808670+tracelogfb@users.noreply.github.com >
Co-authored-by: Stephen Chen <tracelog@meta.com >
2025-05-17 04:20:13 +00:00
Reid
60017dc841
[Misc] reformat the collect-env output ( #18285 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-16 19:46:18 -07:00
Trevor Royer
55f1a468d9
Move cli args docs to its own page ( #18228 ) ( #18264 )
...
Signed-off-by: Trevor Royer <troyer@redhat.com >
2025-05-16 19:43:45 -07:00
Michael Goin
fd195b194e
[V1][P/D] Local attention optimization for NIXL ( #18170 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-16 21:16:33 -04:00
Woosuk Kwon
fabe89bbc4
[Spec Decode] Don't fall back to V0 when spec decoding is enabled ( #18265 )
2025-05-16 16:10:27 -07:00
Jinzhen Lin
e73b7dfd69
[Bugfix] fix an illegal memory access was encountered of marlin kernel + act_order ( #18245 )
2025-05-16 16:02:44 -07:00
Bowen Wang
7fdfa01530
[Sampler] Adapt to FlashInfer 0.2.3 sampler API ( #15777 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-05-16 15:14:03 -07:00
Sanger Steel
aef94c6d07
[CI] Assign reviewer to mergify with changes to Tensorizer files ( #18278 )
2025-05-16 12:04:14 -07:00
Nick Hill
0ceaebf87b
[BugFix] Fix ordering of KVConnector finished send/rcv sets ( #18211 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-16 09:20:54 -07:00
Nick Hill
1db4f47f81
[BugFix] Fix multi async save in MultiConnector ( #18246 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-16 08:13:47 -07:00
Reid
d3d91b6f71
[Misc][MacOS] fix bfloat16 error ( #18249 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-16 15:05:59 +00:00
learner0810
87d871470d
[Model] Use autoweightloader for dbrx ( #18251 )
...
Signed-off-by: learner0810 <zhongjun.li@daocloud.io >
2025-05-16 07:54:13 -07:00
fxmarty-amd
a5f8c111c2
[Fix] Fix typo in resolve_hf_chat_template ( #18259 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
2025-05-16 14:52:41 +00:00
Lain
e23564cb70
use ceil_div in cutlass block scaling shape check ( #17918 )
2025-05-16 03:02:58 -07:00
Isotr0py
390ec88905
[Misc] Consolidate Audio tests into multimodal common generation tests ( #18214 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-16 09:18:08 +00:00
Seiji Eicher
541817670c
[Misc] Add Ray Prometheus logger to V1 ( #17925 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-05-16 01:02:42 -07:00
Vadim Gimpelson
67da5720d4
[PERF] Speed up Qwen2.5-VL model by speed up rotary position embedding ( #17973 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-05-15 23:31:02 -07:00
David Xia
5c04bb8b86
[doc] fix multimodal example script ( #18089 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-16 06:05:34 +00:00
Lucia Fang
3d2779c29a
[Feature] Support Pipeline Parallism in torchrun SPMD offline inference for V1 ( #17827 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
2025-05-15 22:28:27 -07:00
Will Eaton
6b31c84aff
Throw better error for when running into k8s service discovery issue ( #18209 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-05-15 21:07:28 -07:00
Harry Mellor
b18201fe06
Allow users to pass arbitrary JSON keys from CLI ( #18208 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-15 21:05:34 -07:00
Sky Lee
f4937a51c1
[Model] vLLM v1 supports Medusa ( #17956 )
...
Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com >
Signed-off-by: skylee-01 <497627264@qq.com >
Co-authored-by: lisiqi23 <lisiqi23@xiaomi.com >
2025-05-15 21:05:31 -07:00
kliuae
ee659e3b60
[Bugfix][ROCm] Use chunked_prefill_paged_decode as fallback for V1 attention on ROCm ( #18093 )
...
Signed-off-by: kf <kuanfu.liu@embeddedllm.com >
2025-05-15 19:30:17 -07:00
Lucas Wilkinson
4e1c6a0264
[Bugfix] fix rotary embedding test for _get_padded_tensor_shape ( #18229 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-16 01:32:45 +00:00
Lucas Wilkinson
c7852a6d9b
[Build] Allow shipping PTX on a per-file basis ( #18155 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-15 16:41:55 -07:00
Lucia Fang
8795eb9975
[Bugfix] Fix test_eagle test ( #18223 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
2025-05-15 15:59:42 -07:00
Alexei-V-Ivanov-AMD
0b34593017
Adding "AMD: Tensorizer Test" to amdproduction. ( #18216 )
2025-05-15 11:01:25 -07:00
Nicolò Lucchesi
e3f3aee6f4
[Misc] Avoid cuda graph log when sizes still match ( #18202 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-05-15 09:59:38 -07:00
TJian
92540529c0
[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in unquantizedMethod to reenable LLama4 BF16 ( #18205 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-15 09:53:18 -07:00
Zhonghua Deng
fadb8d5c2d
[Bugfix]Change the exception thrown by call_hf_processor from RuntimeError to ValueError ( #18181 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-05-15 09:01:47 -07:00
Sebastian Schoennenbeck
2aa5470ac5
[Frontend] Fix chat template content format detection ( #18190 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-05-15 09:00:21 -07:00
Harry Mellor
51ff154639
Improve examples rendering in docs and GitHub ( #18203 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-15 15:57:49 +00:00
Alexei-V-Ivanov-AMD
566ec04c3d
Adding "Basic Models Test" and "Multi-Modal Models Test (Extended) 3" in AMD Pipeline ( #18106 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-15 08:49:23 -07:00
Thomas Parnell
01c22335ba
[Kernel] [V1] Fix performance regression for triton unified attention ( #18161 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-15 06:39:00 -07:00
hustxiayang
451da4bcbd
add tools into TokenizeChatRequest ( #18187 )
...
Signed-off-by: yangxia <yangxiast@gmail.com >
2025-05-15 04:01:49 -07:00
Harry Mellor
07ad27121f
Update deprecated type hinting in model_loader ( #18130 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-15 04:00:21 -07:00
omahs
a9944aabfa
fix: typos ( #18151 )
...
Signed-off-by: omahs <73983677+omahs@users.noreply.github.com >
2025-05-15 02:16:15 -07:00
Russell Bryant
a8f5aec20a
[V1] Update zmq socket creation in nixl connector ( #18148 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-14 23:17:57 -07:00
David Xia
de71fec81b
[CI] don't skip fixed test_kv_cache_events() ( #18183 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-14 23:17:16 -07:00
Mengqing Cao
70f8b96724
[Bugfix] Fix FusedMoEPrepareAndFinalize for cuda-disalike backends ( #18178 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-14 23:16:31 -07:00
inkcherry
dd2a94596a
[Model] Allow the use of sliding window in Qwen2 ( #17772 )
...
Signed-off-by: inkcherry <mingzhi.liu@intel.com >
2025-05-14 22:29:38 -07:00
Ning Xie
420caf7557
[UT] Add ut for none hash ( #17892 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-15 13:28:11 +08:00
Chenheli Hua
4f07a64075
Support custom implementations of VideoLoader backends. ( #18091 )
2025-05-15 13:26:49 +08:00
Thomas Parnell
e6b8e65d2d
[Bugfix] Fix fp8 tests for triton_unified_attention for Triton 3.3 ( #18013 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-15 13:26:34 +08:00
Harry Mellor
26d0419309
Update deprecated type hinting in models ( #18132 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 22:06:50 -07:00
Luka Govedič
83f74c698f
[Fix][ROCm] Enforce eager for all encoder-decoder models on ROCm ( #18154 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-05-14 22:04:43 -07:00
Reid
2dff093574
[Misc] add lobe-chat support ( #18177 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-15 05:02:23 +00:00
Aaron Pham
afe3236e90
[Chore] astral's ty ( #18116 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-15 05:00:43 +00:00
Mark McLoughlin
65334ef3b9
[V1][Metrics] Remove unused code ( #18158 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-14 20:13:17 -07:00
Chen Zhang
e60f550b38
[v1] Support multiple KV cache groups in GPU model runner ( #17945 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-14 18:54:54 -07:00
David Xia
f25e0d1125
[Bugfix]: make most of test_openai_schema.py pass ( #17664 )
2025-05-14 17:04:35 -07:00
Andrey Talman
09f106a91e
Upload vllm index for the rc builds ( #18173 )
2025-05-14 16:35:56 -07:00
Michael Goin
2142035b51
[V1] Support multiple kv connectors ( #17564 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-05-14 16:28:02 -07:00
Russell Bryant
78aa341d12
[CI] Fix race condition in test_kv_cache_events test ( #18169 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-14 16:27:48 -07:00
Jerry Zhang
7974736740
Add support for loading torchao models with AOPerModuleConfig ( #17826 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-05-14 16:24:59 -07:00
Aaron Pham
2fc9075b82
[V1] Structured Outputs + Thinking compatibility ( #16577 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-05-14 15:45:24 -07:00
Lucas Wilkinson
d93c976a0d
[Kernel] Have rotary embeddings support tensors ( #18046 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-14 15:43:55 -07:00
David Xia
749f792553
[Frontend] decrease import time of vllm.multimodal ( #18031 )
...
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com >
2025-05-14 15:43:32 -07:00
Robert Shaw
856865008e
[CI] Disable Failing Tests ( #18165 )
2025-05-14 13:49:56 -07:00
bnellnm
f9c069c85e
Modularize fused experts and integrate PPLX kernels ( #15956 )
2025-05-14 13:11:54 -07:00
Ekagra Ranjan
418d2f8bfb
[V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model ( #17326 )
...
Co-authored-by: root <root@ekagra-8xh100.us-east5-a .c.serving-efficiency-poc.internal>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-14 12:31:46 -07:00
Chen Zhang
964472b966
[Doc] Update prefix cache metrics to counting tokens ( #18138 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-14 15:23:30 +00:00
Nick Hill
59dd311cf5
[KVConnector] Keep KVTransferParams as a dict ( #18033 )
2025-05-14 08:05:57 -07:00
Cyrus Leung
d066e52013
[Bugfix] Fix chat utils tests ( #18139 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-14 05:38:21 -07:00
Harry Mellor
c8ea982d9b
Update deprecated type hinting in platform, plugins, triton_utils, vllm_flash_attn ( #18129 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 05:28:16 -07:00
Harry Mellor
dc372b9c8a
Update deprecated type hinting in vllm/device_allocator and vllm/distributed ( #18126 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 04:07:57 -07:00
Harry Mellor
9b5b39b650
Update deprecated type hinting in vllm/lora ( #18128 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 03:57:59 -07:00
Reid
9ccc6ded42
[doc] add missing import ( #18133 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-14 10:57:34 +00:00
Cyrus Leung
d62a076e84
[Model] GritLM supports other attention backends ( #18109 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-14 03:33:19 -07:00
Jee Jee Li
259127f8b8
[Bugfix] Fix LoRA test ( #18123 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-14 10:25:47 +00:00
TJian
612c2edb4f
[FEAT] [ROCm]: Add AITER CK 2 Stages MoE support ( #17110 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-14 03:03:11 -07:00
Andrzej Kotłowski
38fe728d60
[Bugfix] Fix QKVCrossParallelLinear::sync_weight_attrs for PyTorch compile ( #17844 )
...
Signed-off-by: Andrzej Kotłowski <akotlowski@habana.ai >
2025-05-14 09:39:51 +00:00
rongfu.leng
82e7f9bb03
[Misc] replace does not exist model ( #18119 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-05-14 02:13:47 -07:00
Jee Jee Li
63dc3426e0
[Model] Add packed_modules_mapping for Qwen3-MOE ( #18118 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-14 02:13:19 -07:00
Cyrus Leung
8f5dc41481
[Bugfix] Fix entrypoints audio test failure ( #18111 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-14 09:08:07 +00:00
wang.yuqi
63ad622233
[New Model]: support GTE NewModel ( #17986 )
2025-05-14 01:31:31 -07:00
majianpeng
e7ef61c1f0
[Bugfix][Example] make lmcache v0 work. ( #18051 )
...
Signed-off-by: Ma, Jianpeng <jianpeng.ma@intel.com >
2025-05-13 23:43:44 -07:00
Jinzhen Lin
d4154c35a2
[Bugfix] fix moe marlin topk_weight loading ( #18080 )
...
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-05-13 23:31:57 -07:00
lkchen
6685890d11
[Fix] Move "model_config" as keyword args in chat_utils.py ( #18098 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-05-13 23:27:26 -07:00
Ecthlion_zyy
33011318c2
Fix broken example: examples/offline_inference/profiling at scheduler_config ( #18117 )
2025-05-13 23:19:14 -07:00
qli88
4f8b373225
[BugFix][AMD] Compatible patch for AITER lib after 04/20 ( #17912 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2025-05-13 23:05:20 -07:00
Charlie Fu
7b2f28deba
[AMD][torch.compile] Enable silu+fp8_quant fusion for rocm ( #18082 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-05-13 22:13:56 -07:00
vllmellm
2d912fb66f
[FEAT] [ROCm] [V1]: Add AITER biased group topk for DeepSeekV3 ( #17955 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-13 22:03:47 -07:00
Michael Goin
12e6c0b41c
[Bugfix][V1] Fix FlashInfer V1 backend using the wrong VllmConfig ( #18086 )
2025-05-13 20:36:17 -07:00
Michael Goin
9a2a6357de
[Bugfix] Fix FP8 Marlin MoE and enable for compressed-tensors models ( #18026 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-13 19:48:33 -07:00
youkaichao
6266c57bae
[core][distributed] add ep group and all2all interface ( #18077 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-05-14 10:46:49 +08:00
Jon Gill
754b699cbe
[Bug]: Fix S3 model/tokenizer path resolution ( #18083 )
...
Signed-off-by: Jon Gill <jon@yurts.ai >
2025-05-13 19:34:17 -07:00
Roger Wang
6e27c6d86b
[Misc] Remove unused numpy tensor ( #18084 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-05-13 19:33:40 -07:00
Nick Hill
d5af47a149
[P/D] Add some more debug logs to NixlConnector ( #18102 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-13 19:33:03 -07:00
Pavani Majety
65f0f74b66
[Hardware/NVIDIA/Modelopt] Fix modelopt forward method for v1 torch.compile ( #18101 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-05-13 19:33:00 -07:00
Luka Govedič
176a95c670
[Fix] Support CUDAGraph capture for encoder-decoder on ROCm ( #18104 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-05-13 19:31:42 -07:00
Chen Zhang
f2ae883b67
[v1][KVCacheManager] pass num_new_computed_tokens to kv cache manager ( #18001 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-13 19:09:39 -07:00
vllmellm
40de1ef455
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature ( #14968 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-13 19:08:20 -07:00
Russell Bryant
0189a65a2e
[Docs] Expand security doc with firewall info ( #18081 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-13 19:36:00 +00:00
Nick Hill
55aa7af994
[V1] DP scale-out (2/N): Decouple engine process management and comms ( #15977 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-13 10:48:21 -07:00
Harry Mellor
0b217da646
Update deprecated type hinting in vllm/adapter_commons ( #18073 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 08:32:51 -07:00
Harry Mellor
19324d660c
Update deprecated type hinting in vllm/compilation ( #18072 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 08:32:48 -07:00
Harry Mellor
fc407a1425
Give auto-merge label workflow permission to add labels to issues ( #18078 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 07:53:13 -07:00
Harry Mellor
009d9e7590
Convert benchmarks to ruff format ( #18068 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 13:43:29 +00:00
Cyrus Leung
b922c2ebd2
[Bugfix] Fix entrypoints metrics tests ( #18063 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-13 06:42:43 -07:00
Russell Bryant
00b14e0f16
[CI] set token permissions for pre-commit CI job ( #17729 )
...
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 13:38:30 +00:00
Russell Bryant
54e467e6f8
[CI] Add token permissions for add-ready-label CI job ( #17730 )
...
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 13:38:13 +00:00
Russell Bryant
79a1d25bbd
[CI] Add workflow permissions for helm CI job ( #17727 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 12:49:07 +00:00
Russell Bryant
9944011b30
[CI] Set token permissions for reminder comment CI job ( #17728 )
...
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 12:46:58 +00:00
Harry Mellor
8c946cecca
Update deprecated type hinting in vllm/transformers_utils ( #18058 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 04:34:37 -07:00
Harry Mellor
ff334ca1cd
Update deprecated type hinting in vllm/profiler ( #18057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 04:34:34 -07:00
Harry Mellor
6223dd8114
Update deprecated type hinting in model_executor/layers ( #18056 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 04:17:23 -07:00
Reid
906f0598fc
[doc] add download/list/delete HF model CLI usage ( #17940 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-13 11:15:51 +00:00
Aaron Pham
cb528d0585
[Fix] check to make sure processor has chat templates ( #18047 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-13 03:04:10 -07:00
Harry Mellor
98fcba1575
Convert .buildkite to ruff format ( #17656 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 09:28:31 +00:00
Russell Bryant
23b3134eb5
[Benchmarks] Refactor run_structured_output_benchmarks.sh ( #17722 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-13 01:47:29 -07:00
Michael Goin
ea6ae8cb45
[Bugfix] Fix marlin moe fallback logic for llama4 ( #18042 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-13 07:53:28 +00:00
Woosuk Kwon
2ff297dce9
[BugFix] Set default random seed to 0 for V1 ( #17929 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-13 07:52:19 +00:00
Jin Huang
8dd0671bac
[Bugfix][V1] Only get input embeddings w/ multi-modal models if first PP ( #17916 )
...
Signed-off-by: Jin Huang <jinhun@amazon.com >
Co-authored-by: Jin Huang <jinhun@amazon.com >
2025-05-13 15:10:07 +08:00
Chen Zhang
f0d610a8ae
[v1][KVCacheManager] Avoid full cache hit by controlling max_length ( #17999 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-13 06:50:38 +00:00
Driss Guessous
e57e4d6e9e
Fix Broken macro for cutlass moe ( #18049 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-05-12 23:31:06 -07:00
Nick Hill
ee5be834e7
[BugFix] Fix 4-GPU RLHF tests ( #18007 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-12 23:03:55 -07:00
Calvin Chen
48545728d8
cleanup invalid prints ( #18050 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-12 23:01:57 -07:00
Chauncey
dc1a821768
[Feature][V1] Support tool_choice: required when using Xgrammar as the StructuredOutputBackend. ( #17845 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-12 23:01:31 -07:00
Cyrus Leung
61e0a506a3
[Bugfix] Avoid repeatedly creating dummy data during engine startup ( #17935 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-12 22:40:19 -07:00
Michael Goin
1df491c522
[Bugfix] Fixes for new marlin moe usage ( #18017 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-13 03:50:04 +00:00
Arjun Kathuria
d8487ef557
[ROCm]: Fix build from source failure with gcc14 and ROCm 6.3 ( #13779 )
...
Signed-off-by: Arjun Kathuria <arjun.kathuria8@gmail.com >
2025-05-12 20:36:33 -07:00
Jee Jee Li
c06af9a959
[Misc] Slight spelling modification ( #18039 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-12 20:36:27 -07:00
Tao He
60f7624334
Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support ( #11844 )
2025-05-12 19:52:47 -07:00
hissu-hyvarinen
f6518b2b48
[ROCm] Skip tests for quantizations incompatible with ROCm ( #17905 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2025-05-12 18:39:28 -06:00
Harry Mellor
d67085c2c8
Remove noisy warnings from SchedulerConfig ( #17995 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 00:33:45 +00:00
Michael Goin
307939f299
Use NVFP4 Marlin for CompressedTensorsW4A16Fp4 ( #18000 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Dipika <dipikasikka1@gmail.com >
2025-05-12 18:07:34 -06:00
Harry Mellor
9d7ea9dbbf
Update some more deprecated type hinting ( #17998 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-12 23:49:33 +00:00
bwshen-mi
acee8f48aa
[Model] Support MiMo-7B inference with MTP ( #17433 )
...
Signed-off-by: wp-alpha <wangpeng66@xiaomi.com >
Co-authored-by: wangpeng66 <wangpeng66@xiaomi.com >
2025-05-12 23:25:33 +00:00
Michael Goin
f065de4e88
Fix FBGEMM integration ( #18002 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-12 23:02:07 +00:00
wwl2755
dc9905368d
[V1][Spec Decode] Eagle unit tests ( #17350 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-05-12 23:01:17 +00:00
Russell Bryant
ebab1ac37c
[CI] Make JSON output tests less likely to fail ( #17859 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-12 22:31:54 +00:00
Yang Wang
2b0db9b0e2
Enable standard language model for torhc nightly ( #18004 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-05-12 14:00:04 -07:00
Robert Shaw
195adb47c0
[Chore] Remove unused method ( #18024 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-05-12 13:59:47 -07:00
Chen Zhang
302f3aca7e
[v1][KVCacheManager] Change prefix caching metric from counting blocks to counting tokens ( #18003 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-12 13:46:12 -07:00
Alexei-V-Ivanov-AMD
e9c730c9bd
Enabling "Weight Loading Multiple GPU Test - Large Models" ( #18020 )
2025-05-12 13:05:33 -07:00
Jade Zheng
289199feb6
[Core] Use platform-agnostic device control for DP engine core ( #17245 )
...
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
2025-05-12 12:09:16 -07:00
Carol Zheng
b9fd0d7a69
[CI/Build] Fix TPU V1 Test mixed use of & and && across tests ( #17968 )
2025-05-12 12:06:59 -07:00
Harry Mellor
72a3f6b898
Construct KVTransferConfig properly from Python instead of using JSON blobs without CLI ( #17994 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-12 11:25:33 -07:00
Jonathan Berkhahn
98ea35601c
[Lora][Frontend]Add default local directory LoRA resolver plugin. ( #16855 )
...
Signed-off-by: jberkhahn <jaberkha@us.ibm.com >
2025-05-12 10:39:10 -07:00
Robert Shaw
d19110204c
[P/D] NIXL Integration ( #17751 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Brent Salisbury <bsalisbu@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Brent Salisbury <bsalisbu@redhat.com >
2025-05-12 09:46:16 -07:00
Maximilien de Bayser
05a4324f8e
Initialize the delta tool call fields explicitly ( #17340 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: igmainc <igmainc@icloud.com >
2025-05-12 13:28:58 +00:00
Jee Jee Li
7ea6cb28b2
[Misc] Improve modelscope import error ( #17983 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-12 10:46:45 +00:00
Aaruni Aggarwal
9fbf2bfbd5
Correcting testcases in builkite job for IBM Power ( #17675 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-05-12 08:11:55 +00:00
Xu Wenqing
3a5ea75129
[Feature] Support DeepSeekV3 Function Call ( #17784 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
Signed-off-by: Xu Wenqing <xuwq1993@qq.com >
2025-05-12 00:45:21 -07:00
Brayden Zhong
891b9d33de
[Fix] Benchmark "EngineClient" has no attribute "model_config" ( #17976 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-05-11 22:55:53 -07:00
Siyuan Liu
430783018c
[Bugfix][TPU] Use np array when updating cache slot_mapping ( #17971 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-05-12 12:58:33 +08:00
Li Wang
19a3c78d1f
[Bugfix] Fix pydantic.errors.PydanticUserError ( #17962 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-05-12 12:58:23 +08:00
Reid
ada50aa295
[bugfix] fix the wrong parser ( #17958 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-12 04:58:02 +00:00
Cheng Kuan Yong Jason
08bf784078
[Bugfix] validate grammar and throw 400 error instead of crashing the engine when xgrammar validation fails ( #17623 )
...
Signed-off-by: Jason Cheng <jasoncky96@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-05-12 09:06:10 +08:00
youkaichao
d45fe333fb
[misc] add instructions on how to install nvshmem/pplx/deepep ( #17964 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-05-11 18:02:39 -07:00
Isotr0py
021c16c7ca
[Model] Broadcast Ovis2 implementation to fit Ovis1.6 ( #17861 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-11 17:56:30 -07:00
TJian
7de18d541b
[BUG] [ROCm] [MLA] Fix variable name bug due to change in variable name in PR #17483 ( #17961 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-11 09:14:30 -07:00
TJian
a810b5b088
[BugFix] [ROCm]: Bugfix and handle addition case of input for rocm_aiter_rms_norm ( #17857 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-11 04:17:11 -07:00
Reid
009b3d5382
[Misc] not show --model in vllm serve --help ( #16691 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-11 08:47:58 +00:00
wang.yuqi
e4b8713380
[New Model]: nomic-embed-text-v2-moe ( #17785 )
2025-05-11 00:59:43 -07:00
Gregory Shtrasberg
06c0922a69
[FP8][ROCm][Attention] Enable FP8 KV cache on ROCm for V1 ( #17870 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-11 15:58:45 +08:00
Dipika Sikka
cd3edfc908
[Misc] Add compressed-tensors NVFP4A16 emulation support ( #17914 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-05-11 15:58:38 +08:00
Frieda Huang
9cea90eab4
[Frontend] Add /classify endpoint ( #17032 )
...
Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com >
2025-05-11 07:57:07 +00:00
Reid
d1110f5b5a
[doc] update lora doc ( #17936 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-11 15:56:21 +08:00
Ben Browning
8132365b74
[Bugfix]: v1 engine - consider lora adapters in allowed_token_ids ( #17855 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2025-05-11 00:53:58 -07:00
Shiyan Deng
eea22a56ab
fix amd triton mla path ( #17871 )
2025-05-11 07:53:31 +00:00
Kuntai Du
9112155283
[Perf] Use small max_num_batched_tokens for A100 ( #17885 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-05-11 07:53:23 +00:00
xinli-centml
90d0a74b60
[Bugfix] Add revision to transformers.Auto*.from_pretrained processors ( #17948 )
...
Signed-off-by: Xin Li <xin@centml.ai >
2025-05-11 07:52:44 +00:00
Jinzhen Lin
d74e5f37bc
[Kernel] fp4 marlin kernel ( #17687 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-05-10 19:58:49 -07:00
Chen Zhang
ca66a1674c
[v1] Rename specialized_manager.py to single_type_kv_cache_manager.py ( #17946 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-10 16:14:12 -07:00
Chen Zhang
950751a987
[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders ( #17483 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-10 16:12:04 -07:00
Reid
4c31218f80
[Misc] remove --model from vllm serve usage ( #17944 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-10 13:23:31 +00:00
Harry Mellor
68311891f5
Don't default construct ModelConfig when default constructing VllmConfig ( #17943 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-10 13:23:00 +00:00
Ximo Guanter
fc4441a4ee
Add missing content type headers to /ping and /health ( #17036 ) ( #17786 )
...
Signed-off-by: Ximo Guanter <ximo.guanter@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-10 07:13:32 +01:00
tracelogfb
246e3e0a36
fix broken test vllm:test_kernels - test_attention_selector.py::test_flash_attn ( #17873 )
...
Co-authored-by: Stephen Chen <tracelog@meta.com >
2025-05-10 10:46:54 +08:00
Mark McLoughlin
7042cc96b0
[V1][Spec Decoding] Log accumulated metrics after system goes idle ( #17913 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-09 18:23:07 -07:00
Pavani Majety
0c0fdae84f
[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model ( #16362 )
2025-05-09 16:24:41 -07:00
Alexei-V-Ivanov-AMD
3b602cdea7
AMD conditional all test execution // new test groups ( #17556 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
2025-05-09 15:35:58 -07:00
Harry Mellor
4b2ed7926a
Improve configs - the rest! ( #17562 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-09 15:18:44 -07:00
Mark McLoughlin
7e3571134f
[V1][Spec Decoding] Include bonus tokens in mean acceptance length ( #17908 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-09 13:32:36 -07:00
Richard Zou
ea2236bf95
Add option to use torch._inductor.standalone_compile ( #17057 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-09 12:59:04 -07:00
Harry Mellor
7d4aedae7c
Handle error when str passed to /v1/audio/transcriptions ( #17909 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-09 19:23:59 +00:00
Michael Goin
22481fbfa3
Update CT WNA16MarlinMoE integration ( #16666 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-09 13:19:45 -04:00
Isotr0py
5c4c08f6f1
[Misc] Auto fallback to float16 for pre-Ampere GPUs when detected bfloat16 config ( #17265 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-09 17:16:12 +00:00
Rui Qiao
c44c384b1c
[Misc] Add references in ray_serve_deepseek example ( #17907 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-05-09 16:59:36 +00:00
Michael Goin
85b72cb7b1
Revert "[BugFix][AMD] Compatible patch for latest AITER(05/07/2025)" ( #17910 )
2025-05-09 08:58:18 -07:00
Cyrus Leung
6e5595ca39
[CI/Build] Automatically retry flaky tests ( #17856 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-09 09:55:17 -06:00
Chen Zhang
200da9a517
[v1] Move block management logic from KVCacheManager to SpecializedManager ( #17474 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-09 15:25:34 +00:00
qli88
9f64e93415
[BugFix][AMD] Compatible patch for latest AITER(05/07/2025) ( #17864 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2025-05-09 08:59:36 -06:00
Reid
ec61ea20a8
[Misc] add dify integration ( #17895 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-09 03:42:39 -07:00
Harry Mellor
c6798baa9c
Change top_k to be disabled with 0 (still accept -1 for now) ( #17773 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-09 10:01:49 +00:00
inkcherry
5b2dcbf0b8
Fix Whisper crash caused by invalid`` max_num_batched_tokens`` config ( #17853 )
...
Signed-off-by: inkcherry <mingzhi.liu@intel.com >
2025-05-09 09:16:26 +00:00
Isotr0py
6e4a93e3f7
[Bugfix][CPU] Fix broken AVX2 CPU TP support ( #17252 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-09 08:55:14 +00:00
vllmellm
217db4baa6
[Bugfix][ROCm] Fix AITER MLA V1 ( #17880 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-09 08:38:21 +00:00
Yan Ma
ff8c400502
[Doc] remove visible token in doc ( #17884 )
...
Signed-off-by: yan <yanma1@habana.ai >
2025-05-09 01:21:31 -07:00
Michael Yao
89a0315f4c
[Doc] Update several links in reasoning_outputs.md ( #17846 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-05-09 01:20:55 -07:00
Simon Mo
3d1e387652
[Docs] Add Slides from NYC Meetup ( #17879 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-05-08 21:46:54 -07:00
Ning Xie
d310e6de98
[BUGFIX]: return fast when request requires prompt logprobs ( #17251 )
2025-05-08 21:25:41 -07:00
Lucas Wilkinson
5e6f939484
[Attention] MLA move rotary embedding to cuda-graph region ( #17668 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-09 11:14:42 +08:00
Shanshan Shen
760e3ecc8f
[V1][Structured Output] Update llguidance (>= 0.7.11) to avoid AttributeError (no StructTag) ( #17839 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-05-08 20:14:18 -07:00
vllmellm
3c9396a64f
[FEAT][ROCm]: Support AITER MLA on V1 Engine ( #17523 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: qli88 <qiang.li2@amd.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
2025-05-09 10:42:05 +08:00
Shu Wang
376786fac1
Add cutlass support for blackwell fp8 blockwise gemm ( #14383 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
2025-05-08 15:09:55 -07:00
Michael Goin
4f605a6de5
Fix noisy warning for uncalibrated q_scale/p_scale ( #17414 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-08 15:56:59 -04:00
Michael Goin
8342e3abd1
[CI] Prune down lm-eval small tests ( #17012 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-08 19:00:26 +00:00
yarongmu-google
a83a0f92b5
[Test] Attempt all TPU V1 tests, even if some of them fail. ( #17334 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-05-08 17:20:54 +00:00
Russell Bryant
226a4272cf
[V1] Improve VLLM_ALLOW_INSECURE_SERIALIZATION logging ( #17860 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-08 16:57:35 +00:00
Russell Bryant
ec54d73c31
[CI] Fix test_collective_rpc ( #17858 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-08 16:47:12 +00:00
Jee Jee Li
a944f8ede7
[Misc] Delete LoRA-related redundancy code ( #17841 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-08 06:02:21 -07:00
Cyrus Leung
015815fe01
[Bugfix] use_fast failing to be propagated to Qwen2-VL image processor ( #17838 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-08 05:39:21 -07:00
Harry Mellor
e4ca6e3a99
Fix transient dependency error in docs build ( #17848 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-08 03:42:03 -07:00
Reid
53d0cb7423
[Misc] add chatbox integration ( #17828 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-08 10:05:26 +00:00
Lu Fang
f50dcb7c21
[Easy] Eliminate c10::optional usage in vllm/csrc ( #17819 )
2025-05-08 03:05:10 -07:00
Cyrus Leung
a1e19b635d
[Doc] Fix a typo in the file name ( #17836 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-08 18:04:18 +08:00
fxmarty-amd
bb239a730f
[Bugfix] Fix quark fp8 format loading on AMD GPUs ( #12612 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
Signed-off-by: kewang2 <kewang2@amd.com >
Co-authored-by: kewang2 <kewang2@amd.com >
2025-05-08 02:53:53 -07:00
Jevin Jiang
a463555dee
[TPU] Fix the test_sampler ( #17820 )
2025-05-08 05:51:33 -04:00
Rick Yuan
ca04b97c93
[Bugfix] Fix tool call template validation for Mistral models ( #17644 )
...
Signed-off-by: Rick Yuan <yuan821120@gmail.com >
Signed-off-by: RIck Yuan <yuan821120@gmail.com >
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com >
2025-05-08 09:47:19 +00:00
xsank
0a9bbaa104
[Misc] support model prefix & add deepseek vl2 tiny fused moe config ( #17763 )
...
Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com >
Co-authored-by: 唯勤 <xsank.mz@alibaba-inc.com >
2025-05-08 07:50:22 +00:00
Qiong Zhou Huang
39956efb3f
[Bugfix] Fix bad words for Mistral models ( #17753 )
...
Signed-off-by: Qiong Zhou Huang <qiong@phonic.co >
2025-05-07 23:32:10 -07:00
Ximingwang-09
597051e56f
[Qwen3]add qwen3-235b-bf16 fused moe config on A100 ( #17715 )
2025-05-07 23:09:32 -07:00
Cyrus Leung
96722aa81d
[Frontend] Chat template fallbacks for multimodal models ( #17805 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-07 23:05:54 -07:00
Agata Dobrzyniewicz
843b222723
[Hardware][Intel-Gaudi] Support Automatic Prefix Caching on HPU ( #17648 )
...
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai >
2025-05-07 22:37:03 -07:00
Akash kaothalkar
e515668edf
[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER ( #17153 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-05-07 22:35:03 -07:00
Hashem Hashemi
5a499e70d5
[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs ( #17071 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: charlifu <charlifu@amd.com >
2025-05-07 22:34:49 -07:00
Russell Bryant
6930a41116
[V1] Add VLLM_ALLOW_INSECURE_SERIALIZATION env var ( #17490 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-05-08 13:34:02 +08:00
Harry Mellor
998eea4a0e
Only log non-default CLI args for online serving ( #17803 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-07 22:33:29 -07:00
Mikhail Podvitskii
c747d84576
[Installation] OpenTelemetry version update ( #17771 )
...
Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com >
2025-05-07 22:32:49 -07:00
Vadim Markovtsev
b2da14a05a
Improve exception reporting in MP engine ( #17800 )
...
Signed-off-by: Vadim Markovtsev <vadim@poolside.ai >
2025-05-08 05:32:39 +00:00
Chanh Nguyen
7ea2adb802
[Core] Support full cuda graph in v1 ( #16072 )
...
Signed-off-by: Chanh Nguyen <cnguyen@linkedin.com >
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com >
2025-05-07 22:30:15 -07:00
Nick Hill
3d13ca0e24
[BugFix] Fix --disable-log-stats in V1 server mode ( #17600 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-08 04:08:15 +00:00
Harry Mellor
66ab3b13c9
Don't call the venv vllm ( #17810 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-08 04:06:39 +00:00
Aaron Pham
a8238bbdb0
[Chore][Doc] uses model id determined from OpenAI client ( #17815 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-08 01:48:57 +00:00
Wallas Henrique
d43f914d42
[Core][Feature] Input metadata dump on crash ( #13407 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2025-05-07 22:15:09 +00:00
Nick Hill
ed5272cf21
[BugFix] Avoid secondary missing MultiprocExecutor.workers error ( #17811 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-07 21:55:04 +00:00
Akshat Tripathi
c20ef40fd0
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend ( #14238 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
2025-05-07 16:28:47 -04:00
Bowen Bao
db593aa67f
[Quantization] Quark MXFP4 format loading ( #16943 )
2025-05-07 15:05:05 -04:00
Isotr0py
f98e307588
[Bugfix] Fix missing lora name mapping for lora without prefix ( #17793 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-07 16:17:12 +00:00
Harry Mellor
646a31e51e
Fix and simplify deprecated=True CLI kwarg ( #17781 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-07 16:51:06 +01:00
Isotr0py
be8ff88e66
[Bugfix] Fix Video IO error for short video ( #17791 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-07 15:36:06 +00:00
Christian Heimes
1a6af1453d
Only depend on importlib-metadata for Python < 3.10 ( #17776 )
...
Signed-off-by: Christian Heimes <christian@python.org >
2025-05-07 07:51:06 -07:00
Gregory Shtrasberg
32aa74c09c
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention ( #17139 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-07 07:12:35 -07:00
Reid
7377dd0307
[doc] update the issue link ( #17782 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-07 20:29:05 +08:00
Yong Hoon Shin
98c89e16ff
Make key optional for rotary embedding ( #17566 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-07 00:11:46 -07:00
Yong Hoon Shin
324a3119b0
Fix test_memory_usage_no_spec ( #17754 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-07 00:10:33 -07:00
Cyrus Leung
8a15c2603a
[Frontend] Add missing chat templates for various MLLMs ( #17758 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-07 00:10:01 -07:00
Satyajith Chilappagari
043e4c4955
Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling ( #16357 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
Co-authored-by: Aaron Dou <yzdou@amazon.com >
Co-authored-by: Shashwat Srijan <sssrijan@amazon.com >
Co-authored-by: Chongming Ni <chongmni@amazon.com >
Co-authored-by: Amulya Ballakur <amulyaab@amazon.com >
Co-authored-by: Patrick Lange <patlange@amazon.com >
Co-authored-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: Lin Lin Pan <tailinpa@amazon.com >
Co-authored-by: Navyadhara Gogineni <navyadha@amazon.com >
Co-authored-by: Yishan McNabb <yishanm@amazon.com >
Co-authored-by: Mrinal Shukla <181322398+mrinalks@users.noreply.github.com >
2025-05-07 00:07:30 -07:00
Jee Jee Li
ba7703e659
[Misc] Remove qlora_adapter_name_or_path ( #17699 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-06 23:10:37 -07:00
Wanrui Dai
f80ae5bdcf
[Kernel] Use fused rmsnorm for some models like qwen3 series ( #17735 )
...
Signed-off-by: evian <eviantai@u.nus.edu >
Co-authored-by: evian <eviantai@u.nus.edu >
2025-05-06 23:10:02 -07:00
Szymon Ożóg
1a45a61387
[Kernel] GGUF MoeVec kernel ( #16780 )
...
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com >
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-05-06 23:07:23 -07:00
Isotr0py
c3e9d5060e
[Misc] Use apply_rotary_emb from vllm_flash_attn for Qwen2-VL vision RoPE ( #17726 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-07 04:51:33 +00:00
Jee Jee Li
822de7fb94
[Misc] Split model loader ( #17712 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-07 12:42:26 +08:00
Woosuk Kwon
8d84d836d1
[BugFix][Spec Decode] Fix hidden size mismatch between target and eagle head ( #17740 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-06 19:51:26 -07:00
Michael Goin
950b71186f
Replace lm-eval bash script with pytest and use enforce_eager for faster CI ( #17717 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 18:00:10 -07:00
Michael Goin
e50a1f1a9c
[TPU] Add kernel test for moe_pallas ( #17496 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-05-06 17:59:57 -07:00
Michael Goin
a17cef70ea
Removed unused marlin cuda code ( #17684 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 17:59:47 -07:00
Chih-Chieh Yang
18dd5e01f2
[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels ( #17146 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-05-06 17:59:30 -07:00
Yang Wang
6de3e13413
Add logging for torch nightly version ( #17669 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-05-07 00:45:51 +00:00
Hongxia Yang
ed3a1d2106
[ROCm] fix num_stages for default moe config to avoid triton OutOfResource error ( #17744 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-05-07 00:39:48 +00:00
Harry Mellor
022afbeb4e
Fix doc build performance ( #17748 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-07 00:36:41 +00:00
Thomas Parnell
2f925e5777
[Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode ( #16828 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-06 18:21:48 -04:00
Gregory Shtrasberg
de906b95f9
[Bugfix] Fix for the condition to accept empty encoder inputs for mllama ( #17732 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-06 19:59:06 +00:00
d.transposed
d456aea71f
[Misc] Add Next Edit Prediction (NEP) datasets support in benchmark_serving.py ( #16839 )
...
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
Signed-off-by: dtransposed <>
Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
2025-05-06 15:38:45 -04:00
Jevin Jiang
621ca2c0ab
[TPU] Increase block size and reset block shapes ( #16458 )
2025-05-06 13:55:04 -04:00
Harry Mellor
6115b11582
Make right sidebar more readable in "Supported Models" ( #17723 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-06 16:48:26 +00:00
Cyrus Leung
5b8c390747
[Bugfix] Fix modality limits in vision language example ( #17721 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-06 16:12:28 +00:00
Reid
7525d5f3d5
[doc] Add RAG Integration example ( #17692 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-06 16:10:23 +00:00
Chen Zhang
aabcd2cae3
[v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager ( #17479 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-06 08:50:34 -07:00
Michael Yao
0d115460a7
[Docs] Use gh-file to add links to tool_calling.md ( #17709 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-05-06 15:27:19 +00:00
Aaron Pham
175bda67a1
[Feat] Add deprecated=True to CLI args ( #17426 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-06 08:11:27 -07:00
Chen Zhang
cba31c47c4
[v1] AttentionMetadata for each layer ( #17394 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-06 07:58:37 -07:00
Li, Jiang
a6fed02068
[V1][PP] Support PP for MultiprocExecutor ( #14219 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-05-06 07:58:05 -07:00
Michael Goin
d419aa5dc4
[V1] Enable TPU V1 backend by default ( #17673 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 06:49:49 -07:00
Mengqing Cao
f9bc5a0693
[Bugfix] Fix triton import with local TritonPlaceholder ( #17446 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-06 17:53:09 +08:00
Harry Mellor
05e1f96419
Fix dockerfilegraph pre-commit hook ( #17698 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-06 08:56:48 +00:00
Lucas Wilkinson
6eae34533a
[Misc] Fix ScalarType float4 naming ( #17690 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-06 01:07:15 -07:00
Cyrus Leung
63ced7b43f
[Doc] Update notes for H2O-VL and Gemma3 ( #17219 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-06 07:51:02 +00:00
Mikhail Podvitskii
dc47ba32f8
[Bugfix] Fixed prompt length for random dataset ( #17408 )
...
Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com >
2025-05-06 07:00:08 +00:00
Richard Zou
edbf2d609e
[easy] Fix logspam on PiecewiseBackend errors ( #17138 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-05 23:46:11 -07:00
Stan Wozniak
999328be0d
[Model] Add GraniteMoeHybrid 4.0 model ( #17497 )
...
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com >
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com >
Co-authored-by: Thomas Ortner <boh@zurich.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-05-06 12:00:31 +08:00
Michael Goin
98834fefaa
Update nm to rht in doc links + refine fp8 doc ( #17678 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 00:41:14 +00:00
Varun Sundar Rabindranath
90bd2ae172
[Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument ( #17677 )
2025-05-05 17:34:29 -07:00
Nicolò Lucchesi
5941e0b7ea
[TPU][V1] Add support for top-logprobs ( #17072 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-05-05 14:20:15 -07:00
XiongfeiWei
9765940824
[TPU] Enable gemma3-27b with TP>1 on multi-chips. ( #17335 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-05-05 14:19:58 -07:00
Nick Hill
5ea5c514da
[BugFix] Increase timeout for startup failure test ( #17642 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-05 20:53:19 +00:00
Russell Bryant
d3efde8176
[Benchmarks] Remove invalid option under V1 engine ( #17651 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-05 16:30:22 -04:00
Thomas J. Fan
aea302be6c
Use git-path commit in hook ( #17616 )
...
Signed-off-by: Thomas J. Fan <thomasjpfan@gmail.com >
2025-05-05 17:55:32 +00:00
Isotr0py
cc05b90d86
[Doc] Fix broken cuda installation doc rendering ( #17654 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-05 17:52:40 +00:00
Jinzhen Lin
1d0c9d6b2d
[Kernel] some optimizations for dense marlin and moe marlin ( #16850 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-05-05 09:39:30 -07:00
Tyler Michael Smith
f62cad6431
[Build/CI] Upgrade CUTLASS to 3.9.2 ( #17641 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-05-04 19:23:17 -07:00
Chauncey
5394ad7387
[Bugfix] fix KeyError on top logprobs are special tokens ( #17637 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-04 19:22:35 -07:00
Tyler Michael Smith
68e1ee0072
[Bugfix][Easy] Fix whitespace in shm_broadcast.py logging ( #17635 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-05-04 19:20:19 -07:00
Cyrus Leung
2858830c39
[Bugfix] Prioritize dtype in root config before checking text config ( #17629 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-04 12:43:05 +00:00
Harry Mellor
d6484ef3c3
Add full API docs and improve the UX of navigating them ( #17485 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-03 19:42:43 -07:00
Cyrus Leung
46fae69cf0
[Misc] V0 fallback for --enable-prompt-embeds ( #17615 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-03 22:59:24 +00:00
Isotr0py
f66f1e0fa3
[Bugfix] Fix broken Qwen2.5-omni tests ( #17613 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-03 17:08:14 +00:00
Cyrus Leung
887d7af882
[Core] Gate prompt_embeds behind a feature flag ( #17607 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-04 00:19:20 +08:00
Gregory Shtrasberg
a92842454c
[Bugfix][ROCm] Using device_type because on ROCm the API is still torch.cuda ( #17601 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-02 22:25:47 -07:00
Tyler Michael Smith
c8386fa61d
[Build/CI] Upgrade CUTLASS to 3.9.1 ( #17602 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-05-02 22:25:14 -07:00
Chenyaaang
87baebebd8
[Frontend][TPU] Add TPU default max-num-batched-tokens based on device name ( #17508 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-05-02 21:42:44 -07:00
rasmith
e3d0a1d190
[Quantizaton] [AMD] Add support for running DeepSeek int8 w8a8 MoE on ROCm ( #17558 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-05-02 21:41:10 -07:00
22quinn
d47b605eca
Update test requirements to CUDA 12.8 ( #17576 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-05-02 21:40:15 -07:00
Liangfu Chen
22c6f6397f
[Neuron][Build] Require setuptools >= 77.0.3 for PEP 639 ( #17603 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-05-03 02:41:59 +00:00
Kevin H. Luu
3ec97e2cc5
[release] Add command to clean up Docker containers/images in TPU release machine ( #17606 )
2025-05-02 18:54:34 -07:00
Eric Hartford
9b103a1d76
fix typo in logging ( #17605 )
2025-05-02 18:04:40 -07:00
Richard Zou
b90b0852e9
[easy] Print number of needed GPUs in skip message ( #17594 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-02 15:27:43 -07:00
Xiaodong Wang
9352cdb56d
[Hardware][AMD] Improve OAM device ID + llama4 Maverick MOE tuning ( #16263 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
Co-authored-by: Lu Fang <lufang@fb.com >
2025-05-02 19:44:19 +00:00
Zhiyu
182f40ea8b
Add NVIDIA TensorRT Model Optimizer in vLLM documentation ( #17561 )
2025-05-02 11:36:46 -07:00
Caleb_Du
3e887d2e0c
permute/unpermute kernel for moe optimization ( #14568 )
...
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn >
2025-05-02 11:31:55 -07:00
Lucas Wilkinson
0f87d8f7b2
[BugFix][Attention] Fix sliding window attention in V1 giving incorrect results ( #17574 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-02 11:01:38 -07:00
Hui Liu
4c33d67321
[Bugfix] fix tmp_out and exp_sums dimensions ( #17438 )
...
Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com >
2025-05-02 16:44:07 +00:00
Cyrus Leung
cb234955df
[Misc] Clean up input processing ( #17582 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 08:11:53 -07:00
Reid
3a500cd0b6
[doc] miss result ( #17589 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-02 07:04:49 -07:00
Michael Goin
868c546da4
Support W8A8 INT8 MoE for compressed-tensors ( #16745 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-02 10:03:32 -04:00
Cyrus Leung
99404f53c7
[Security] Fix image hash collision ( #17378 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 08:36:39 -04:00
Harry Mellor
785d75a03b
Automatically tell users that dict args must be valid JSON in CLI ( #17577 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-02 05:24:55 -07:00
Reid
6d1479ca4b
[doc] add the print result ( #17584 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-02 05:24:45 -07:00
Yang Wang
b8b0859b5c
add more pytorch related tests for torch nightly ( #17422 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-05-02 03:29:59 -07:00
Cyrus Leung
d7543862bd
[Misc] Rename assets for testing ( #17575 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 03:29:25 -07:00
Robert Shaw
c777df79f7
[BugFix] Fix Memory Leak ( #17567 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-05-02 01:07:03 -07:00
Andrew Sansom
cc2a77d7f1
[Core] [Bugfix] Add Input Embeddings ( #15428 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: 临景 <linjing.yx@alibaba-inc.com >
Co-authored-by: Bryce1010 <bryceyx@gmail.com >
Co-authored-by: Nan2018 <nan@protopia.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 01:06:39 -07:00
Isotr0py
9e2de9b9e9
[Bugifx] Remove TritonPlaceholder from sys.modules ( #17317 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-02 00:45:01 -07:00
Jerry Zhang
109e15a335
Add pt_load_map_location to allow loading to cuda ( #16869 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-05-01 23:23:42 -07:00
Michael Goin
f192ca90e6
Fix PixtralHF missing spatial_merge_size ( #17571 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-01 22:14:09 -07:00
Cyrus Leung
f89d0e11bf
[Misc] Continue refactoring model tests ( #17573 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 22:06:08 -07:00
Michael Goin
b4003d11fc
Check if bitblas is installed during support check ( #17572 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-02 04:32:54 +00:00
Michael Goin
292fc59d61
[CI] Actually run tests/kv_transfer/test_disagg.py in CI ( #17555 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-02 04:05:04 +00:00
Lucas Wilkinson
afcb3f8863
[Attention] MLA move o_proj q_proj into cuda-graph region ( #17484 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-02 03:16:26 +00:00
David Xia
afb12e4294
[Doc] note that not all unit tests pass on CPU platforms ( #17554 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-02 02:57:21 +00:00
Michael Goin
24aebae177
[Bugfix] Disable gptq_bitblas for <SM80 to fix GPTQ on V100/T4 ( #17541 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-01 17:59:35 -07:00
qizixi
39c0813a7f
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE3 ( #17504 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-05-01 16:19:30 -07:00
Chenyaaang
9b70e2b4c1
[Misc][Tools][Benchmark] Publish script to auto tune server parameters ( #17207 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-05-01 19:53:03 +00:00
Chen Xia
173daac19d
[Bug]change the position of cuda_graph_sizes in dataclasses ( #17548 )
...
Signed-off-by: CXIAAAAA <cxia0209@gmail.com >
2025-05-01 11:52:37 -07:00
sstamenk
04f2cfc894
Remove duplicate code from dbrx.py ( #17550 )
2025-05-01 11:51:58 -07:00
Juan Villamizar
811a6c0972
[ROCM] Add gfx950 to the custom attention archs ( #16034 )
...
Signed-off-by: jpvillam <Juan.Villamizar@amd.com >
Signed-off-by: seungrokjung <seungrok.jung@amd.com >
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: seungrokjung <seungrok.jung@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-01 11:18:28 -07:00
Cyrus Leung
9b1769dd9a
[Bugfix] Fix lint error ( #17547 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 11:12:19 -07:00
Chen Xia
61c299f81f
[Misc]add configurable cuda graph size ( #17201 )
...
Signed-off-by: CXIAAAAA <cxia0209@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-01 11:04:50 -07:00
Hongxia Yang
4acfa3354a
[ROCm] update installation guide to include build aiter from source instructions ( #17542 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-01 11:01:28 -07:00
Isotr0py
88c8304104
[Model] Refactor Ovis2 to support original tokenizer ( #17537 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-01 11:00:53 -07:00
Harry Mellor
6768ff4a22
Move the last arguments in arg_utils.py to be in their final groups ( #17531 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-01 10:31:44 -07:00
Cyrus Leung
f2e7af9b86
[CI/Build] Remove awscli dependency ( #17532 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 09:20:54 -07:00
Reid
7423cf0a9b
[Misc] refactor example - cpu_offload_lmcache ( #17460 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-01 15:05:24 +00:00
Sage Moore
460a2b1100
[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations ( #10867 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-05-01 07:59:28 -07:00
Hongxia Yang
28566d73b3
[ROCm] remove unsupported archs from rocm triton flash-attention supported list ( #17536 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-05-01 07:54:25 -07:00
Chauncey
98060b001d
[Feature][Frontend]: Deprecate --enable-reasoning ( #17452 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-01 06:46:16 -07:00
TJian
f5a3c655b2
[FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe config ( #17535 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-01 06:37:17 -07:00
Reid
7169f87ad0
[doc] add streamlit integration ( #17522 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-01 13:34:02 +00:00
Huy Do
b74d888c63
Fix more broken speculative decode tests ( #17450 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-05-01 06:05:58 -07:00
TJian
2007d4d54f
[FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X ( #17530 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-01 06:03:13 -07:00
Cyrus Leung
48e925fab5
[Misc] Clean up test docstrings and names ( #17521 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 05:19:32 -07:00
Cyrus Leung
1903c0b8a3
[Frontend] Show progress bar for adding requests ( #17525 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 05:15:32 -07:00
Teruaki Ishizaki
86a1f67a3b
[Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to select a model ( #17285 )
...
Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com >
2025-05-01 11:54:51 +00:00
Harry Mellor
a257d9bccc
Improve configs - ObservabilityConfig ( #17453 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-01 03:52:05 -07:00
Chauncey
015069b017
[Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content ( #17515 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-01 03:29:01 -07:00
Russell Bryant
fbefc8a78d
[Core] Enable IPv6 with vllm.utils.make_zmq_socket() ( #16506 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-01 09:38:18 +00:00
Keyun Tong
26bc4bbcd8
Avoid overwriting vllm_compile_cache.py ( #17418 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-05-01 07:30:57 +00:00
Lucas Wilkinson
3c3d767201
[BugFix] Fix mla cpu - missing 3 required positional arguments ( #17494 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-01 14:36:52 +08:00
Noah Yoshida
13cf6b6236
[BugFix] fix speculative decoding memory leak when speculation is disabled ( #15506 )
...
Signed-off-by: Noah Yoshida <noahcy117@gmail.com >
2025-04-30 23:28:17 -07:00
Hongxia Yang
90d0a54c4d
[ROCm] Effort to reduce the number of environment variables in command line ( #17229 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-04-30 23:27:06 -07:00
Russell Bryant
7a0a146c54
[Build] Require setuptools >= 77.0.3 for PEP 639 ( #17389 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-30 23:25:36 -07:00
Alexei-V-Ivanov-AMD
7ab643e425
FIxing the AMD test failures caused by PR#16457 ( #17511 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-30 23:23:07 -07:00
Cyrus Leung
afb4429b4f
[CI/Build] Reorganize models tests ( #17459 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-30 23:03:08 -07:00
Michael Goin
aa4502e7f3
[CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg ( #17500 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-30 21:03:30 -07:00
Michael Goin
17b4d85f63
[CI][TPU] Skip structured outputs+spec decode tests on TPU ( #17510 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-30 20:36:20 -07:00
NaLan ZeYu
1144a8efe7
[Bugfix] Temporarily disable gptq_bitblas on ROCm ( #17411 )
...
Signed-off-by: Yan Cangang <nalanzeyu@gmail.com >
2025-04-30 19:51:45 -07:00
Gregory Shtrasberg
08fb5587b4
[Bugfix][ROCm] Fix import error on ROCm ( #17495 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-30 19:51:42 -07:00
Siyuan Liu
dbc18e7816
[CI][TPU] Skip Multimodal test ( #17488 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-04-30 19:51:39 -07:00
Alex Brooks
02bd654846
[Misc] Rename Audios -> Audio in Qwen2audio Processing ( #17507 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-30 19:51:36 -07:00
Rahul Tuli
200bbf92e8
Bump Compressed Tensors version to 0.9.4 ( #17478 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-30 15:24:45 -07:00
Chen Zhang
81ecf425f0
[v1][Spec Decode] Make sliding window compatible with eagle prefix caching ( #17398 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-30 18:25:53 +00:00
David Xia
42d9a2c4c7
doc: fix bug report Github template formatting ( #17486 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-04-30 10:03:20 -07:00
Reid
2ac74d098e
[doc] add install tips ( #17373 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-30 17:02:41 +00:00
Gregory Shtrasberg
584f5fb4c6
[Bugfix][ROCm] Restrict ray version due to a breaking release ( #17480 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-30 09:59:06 -07:00
zh Wang
d586ddc691
[BugFix] Fix authorization of openai_transcription_client.py ( #17321 )
...
Signed-off-by: zh Wang <rekind133@outlook.com >
2025-04-30 09:51:05 -07:00
Michael Goin
0b7e701dd4
[Docs] Update optimization.md doc ( #17482 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-30 09:34:02 -07:00
Russell Bryant
947f2f5375
[V1] Allow turning off pickle fallback in vllm.v1.serial_utils ( #17427 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-30 16:10:54 +00:00
Pete Savage
739e03b344
[Bugfix] Fixed mistral tokenizer path when pointing to file ( #17457 )
...
Signed-off-by: Pete Savage <psavage@redhat.com >
2025-04-30 08:08:37 -07:00
Aaron Pham
da4e7687b5
[Fix] Support passing args to logger ( #17425 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-30 08:06:58 -07:00
Russell Bryant
39317cf42b
[Docs] Add command for running mypy tests from CI ( #17475 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-30 08:06:09 -07:00
Chauncey
2990cee95b
[Feature] The Qwen3 reasoning parser supports guided decoding ( #17466 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-30 07:48:21 -07:00
Alec
0be6d05b5e
[V1][Metrics] add support for kv event publishing ( #16750 )
...
Signed-off-by: alec-flowers <aflowers@nvidia.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2025-04-30 07:44:45 -07:00
Marko Rosenmueller
77073c77bc
[Core] Prevent side-channel attacks via cache salting ( #17045 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-04-30 20:27:21 +08:00
Nicolò Lucchesi
a7d5b016bd
[TPU][V1][CI] Update regression test baseline for v6 CI ( #17064 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-30 04:03:22 -07:00
rongfu.leng
d803786731
[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None ( #15755 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-30 18:20:39 +08:00
Chauncey
1534d389af
[Misc] Remove deprecated files ( #17447 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-30 01:52:19 -07:00
Lu Fang
ece5a8b0b6
Make the _apply_rotary_emb compatible with dynamo ( #17435 )
2025-04-30 07:52:48 +00:00
Marco
54072f315f
[MODEL ADDITION] Ovis2 Model Addition ( #15826 )
...
Signed-off-by: Marco <121761685+mlinmg@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-04-30 07:33:29 +00:00
Chauncey
be633fba0f
[Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' ( #17434 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-30 00:11:04 -07:00
Kunshang Ji
ed6cfb90c8
[Hardware][Intel GPU] Upgrade to torch 2.7 ( #17444 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com >
2025-04-30 00:03:58 -07:00
Kunshang Ji
6ed9f6047e
[Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue ( #17298 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-04-29 22:54:10 -07:00
Michael Goin
a44c4f1d2f
Support LoRA for Mistral3 ( #17428 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-29 21:10:30 -07:00
Huy Do
88fcf00dda
Fix some speculative decode tests with tl.dot ( #17371 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-04-29 19:41:02 -07:00
Harry Mellor
d1f569b1b9
Fix call to logger.info_once ( #17416 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 19:39:18 -07:00
Harry Mellor
13698db634
Improve configs - ModelConfig ( #17130 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-30 10:38:22 +08:00
Huy Do
2c4f59afc3
Update PyTorch to 2.7.0 ( #16859 )
2025-04-29 19:08:04 -07:00
Gabriel Marinho
1c2bc7ead0
Truncation control for embedding models ( #14776 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-04-30 09:24:57 +08:00
Kevin H. Luu
4055130a85
[release] Always git fetch all to get latest tag on TPU release ( #17322 )
2025-04-29 17:52:11 -07:00
Benjamin Chislett
34120f5acd
[V1][Feature] Enable Speculative Decoding with Structured Outputs ( #14702 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com >
2025-04-30 00:02:10 +00:00
Harry Mellor
7489ec0bab
Remove Bamba 9B from CI ( #17407 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 21:10:31 +00:00
Bryan Lu
70788bdbdc
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE ( #17211 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
2025-04-29 21:10:00 +00:00
Dilip Gowda Bhagavan
c9c1b59e59
Fix: Python package installation for opentelmetry ( #17049 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
2025-04-29 20:20:24 +00:00
Harry Mellor
0350809f3a
Remove Falcon3 2x7B from CI ( #17404 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 19:52:25 +00:00
Harry Mellor
a6977dbd15
Simplify (and fix) passing of guided decoding backend options ( #17008 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 19:02:23 +00:00
Isotr0py
2fa2a50bf9
[Bugfix] Fix Minicpm-O-int4 GPTQ model inference ( #17397 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-29 18:21:42 +00:00
Reid
08e15defa9
[CI/Build] Add retry mechanism for add-apt-repository ( #17107 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-29 10:40:52 -07:00
Aaron Pham
b37685afbb
[CI] Uses Python 3.11 for TPU ( #17359 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-29 17:39:16 +00:00
Nicolò Lucchesi
792595b59d
[TPU][V1][CI] Replace python3 setup.py develop with standard pip install --e on TPU ( #17374 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-29 10:36:48 -07:00
casinca
0c1c788312
[Doc][Typo] Fixing label in new model requests link in overview.md ( #17400 )
2025-04-29 10:29:48 -07:00
Russell Bryant
56d64fbe30
[Docs] Propose a deprecation policy for the project ( #17063 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-29 10:29:44 -07:00
Alexei-V-Ivanov-AMD
608968b7c5
Enabling multi-group kernel tests. ( #17115 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-29 10:27:27 -07:00
TY-AMD
06ffc7e1d3
[Misc][ROCm] Exclude cutlass_mla_decode for ROCm build ( #17289 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
2025-04-29 10:26:42 -07:00
Qiming Zhang
d3cf61b89b
fix gemma3 results all zero ( #17364 )
...
Signed-off-by: mayuyuace <qiming1.zhang@intel.com >
2025-04-29 09:40:25 -07:00
mofanke
a39203f99e
[Bugfix] add qwen3 reasoning-parser fix content is None when disable … ( #17369 )
...
Signed-off-by: mofanke <mofanke@gmail.com >
2025-04-29 16:32:40 +00:00
Chen Zhang
24e6ad3f16
[V1] Remove num_input_tokens from attn_metadata ( #17193 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-29 09:28:41 -07:00
Harry Mellor
2ef5d106bb
Improve literal dataclass field conversion to argparse argument ( #17391 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 16:25:08 +00:00
a2q1p
0ed27ef66c
Fix: Spelling of inference ( #17387 )
2025-04-29 09:23:39 -07:00
Harry Mellor
900edfa8d4
Transformers backend tweaks ( #17365 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 09:08:03 -07:00
Cyrus Leung
88ad9ec6b2
[Frontend] Support chat_template_kwargs in LLM.chat ( #17356 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 22:03:35 +08:00
Harry Mellor
40896bdf3f
pre-commit autoupdate (#17380 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 06:46:55 -07:00
Cyrus Leung
00ee37efa2
[Bugfix] Clean up MiniMax-VL and fix processing ( #17354 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 20:42:16 +08:00
Jee Jee Li
890f104cdf
[Doc] Fix QWen3MOE info ( #17381 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-29 12:38:32 +00:00
Harry Mellor
4a5e13149a
Update docs requirements ( #17379 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 11:35:47 +00:00
Ekagra Ranjan
97cc8729f0
[Model] Ignore rotary embed load for Cohere model ( #17319 )
2025-04-29 00:30:40 -07:00
Gregory Shtrasberg
4464109219
[Build][Bugfix] Restrict setuptools version to <80 ( #17320 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-29 00:17:23 -07:00
Hyogeun Oh (오효근)
193e78e35d
[Fix] Documentation spacing in compilation config help text ( #17342 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-04-29 00:16:17 -07:00
ponix-j
bdb2cddafc
[Misc]Use a platform independent interface to obtain the device attributes ( #17100 )
2025-04-29 06:59:13 +00:00
Cyrus Leung
ebb3930d28
[Misc] Move config fields to MultiModalConfig ( #17343 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 06:37:21 +00:00
qscqesze
cde384cd92
[Model] support MiniMax-VL-01 model ( #16328 )
...
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-04-29 12:05:50 +08:00
Chauncey
96e06e3cb7
[Misc] Add a Jinja template to support Mistral3 function calling ( #17195 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-28 19:53:44 -07:00
Zhengyuan Su (苏政渊)
17eb306fcc
[Bugfix] Add contiguous call inside rope kernel wrapper ( #17091 )
...
Signed-off-by: 苏政渊 <suzhengyuan@moonshot.cn >
Co-authored-by: 苏政渊 <suzhengyuan@moonshot.cn >
2025-04-28 19:24:07 -07:00
Richard Zou
165cb56329
Ignore '<string>' filepath ( #17330 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-28 19:23:29 -07:00
Richard Barnes
d6da8a8ff2
[Bugfix] Fix numel() downcast in fused_layernorm_dynamic_per_token_quant.cu ( #17316 )
2025-04-28 19:23:18 -07:00
Lucia Fang
b4ac4fa04d
[model] make llama4 compatible with pure dense layers ( #17315 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
2025-04-29 10:22:22 +08:00
Ekagra Ranjan
e136000595
[V1][Spec Decode] Make Eagle model arch config driven ( #17323 )
2025-04-29 10:22:02 +08:00
Michał Moskal
86d9fc29cb
implement Structural Tag with Guidance backend ( #17333 )
...
Signed-off-by: Michal Moskal <michal@moskal.me >
2025-04-29 02:21:32 +00:00
Cyrus Leung
506475de5f
[Optim] Compute multimodal hash only once per item ( #17314 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 09:40:35 +08:00
Ekagra Ranjan
cfe4532093
[Benchmark] Add single turn MTBench to Serving Bench ( #17202 )
2025-04-28 16:46:15 -07:00
Michael Goin
8fc88d63f1
[Model] Add tuned triton fused_moe configs for Qwen3Moe ( #17328 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-28 15:20:24 -07:00
Alex Wu
6e74fd4945
Support loading transformers models with named parameters ( #16868 )
...
Signed-off-by: Alex <alexwu@character.ai >
2025-04-28 23:15:58 +01:00
Simon Mo
dcbac4cb4b
[Model] Qwen3 Dense FP8 Compat Fixes ( #17318 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
2025-04-28 14:12:01 -07:00
Charlie Fu
ed2462030f
[Bugfix] Fix moe weight losing all extra attrs after process_weights_after_loading. ( #16854 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-04-28 21:05:07 +00:00
Lucas Wilkinson
cc5befbced
[BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) ( #17283 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-28 13:55:50 -07:00
Aaron Pham
2c89cd96a8
[Chore] cleanup license indicators in light of SPDX ( #17259 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-04-28 19:43:52 +00:00
Russell Bryant
a0304dc504
[Security] Don't bind tcp zmq socket to all interfaces ( #17197 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-28 10:08:20 -07:00
Harry Mellor
c7941cca18
Explicitly explain quant method override ordering and ensure all overrides are ordered ( #17256 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 16:55:31 +00:00
Harry Mellor
b6dd32aa07
Make name of compressed-tensors quant method consistent across vLLM ( #17255 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 16:28:13 +00:00
Harry Mellor
f94886946e
Improve conversion from dataclass configs to argparse arguments ( #17303 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 16:22:12 +00:00
Russell Bryant
72dfe4c74f
[Docs] Add a security guide ( #17230 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-28 15:12:17 +00:00
Cyrus Leung
8b464d9660
[Misc] Clean up Qwen2.5-Omni code ( #17301 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-28 06:20:45 -07:00
Nicolò Lucchesi
889ebb2638
[Misc] Minor typo/grammar in platforms/interface.py ( #17307 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-28 05:45:42 -07:00
Reid
3ad986c28b
[doc] update wrong model id ( #17287 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-28 04:20:51 -07:00
Cyrus Leung
344e193b7d
[Bugfix] Add missing get_language_model to new MLLMs ( #17300 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-28 04:09:57 -07:00
Harry Mellor
fb1c933ade
Add missing class docstring for PromptAdapterConfig ( #17302 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 04:06:59 -07:00
idouba
72c5b97231
Update tpu_worker.py 's typo ( #17288 )
2025-04-28 04:01:15 -07:00
Alex Brooks
fa93cd9f60
[Model] Add Granite Speech Support ( #16246 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-28 10:05:00 +00:00
Cyrus Leung
aec9674dbe
[Core] Remove legacy input mapper/processor from V0 ( #15686 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-28 15:38:48 +08:00
Wanrui Dai
7fcc4223dc
[Minor][Models] Pass partial_rotary_factor parameter to rope ( #17266 )
...
Signed-off-by: evian <eviantai@u.nus.edu >
Co-authored-by: evian <eviantai@u.nus.edu >
2025-04-28 04:28:59 +00:00
Nick Hill
8262a3e23b
[Misc] Validate stop_token_ids contents ( #17268 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-28 03:54:05 +00:00
Reid
f211331c48
[Doc] small fix ( #17277 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-28 03:53:35 +00:00
Kuntai Du
9053d0b134
[Doc] Fix wrong github link in LMCache examples ( #17274 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-04-28 03:09:11 +00:00
Michael Goin
cb3f2d8d10
[Bugfix] Fix Mistral3 spatial merge error ( #17270 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-27 19:40:05 -07:00
TherLF
c12df53b60
[Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… ( #16751 )
...
Signed-off-by: Ther-LF <2639852836@qq.com >
2025-04-27 19:38:42 -07:00
Lennart K. M. Schulz
d1aeea7553
[Bugfix] Fix missing ARG in Dockerfile for arm64 platforms ( #17261 )
...
Signed-off-by: lkm-schulz <44176356+lkm-schulz@users.noreply.github.com >
2025-04-27 19:38:14 -07:00
Lucas Wilkinson
d8bccde686
[BugFix] Fix vllm_flash_attn install issues ( #17267 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-27 17:27:56 -07:00
Lily Liu
20e489eaa1
[V1][Spec Decode] Make eagle compatible with prefix caching. ( #17137 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-04-27 09:29:43 -07:00
Cyrus Leung
4213475ec7
[Metrics] Fix minor inconsistencies in bucket progression ( #17262 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-27 16:19:39 +00:00
Reid
d92879baf6
[doc] Add feature status legend ( #17257 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-27 08:17:02 -07:00
cascade
690fe019f0
[Feature] support sequence parallelism using compilation pass ( #16155 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-04-27 06:29:35 -07:00
Kaixi Hou
ed7a29d9f8
[NVIDIA] Support Cutlass MLA for Blackwell GPUs ( #16032 )
...
Signed-off-by: kaixih <kaixih@nvidia.com >
2025-04-27 06:29:21 -07:00
Alex Brooks
756848e79e
[Bugfix] Fix Lora Name Parsing ( #17196 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-27 20:33:09 +08:00
Flex Wang
18445edd0f
[Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens ( #17033 )
...
Signed-off-by: sfc-gh-zhwang <flex.wang@snowflake.com >
2025-04-27 12:30:53 +00:00
Jade Zheng
30215ca61f
[MISC] Use string annotation types for class definitions ( #17244 )
...
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
2025-04-27 08:39:57 +00:00
Chen Zhang
838cedade7
[Bugfix] Get a specific type of layer from forward context ( #17222 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-27 00:58:05 -07:00
Jee Jee Li
4283a28c2f
[Bugfix] Fix QWen2 VL multimodal mapping ( #17240 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-27 05:53:23 +00:00
Cyrus Leung
93a126fbc7
[Misc] Make cached tokenizer pickle-compatible ( #17048 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-27 13:05:00 +08:00
rasmith
8e4b351a0c
[Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel ( #12591 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-04-27 00:35:08 +00:00
Happy
9869453c42
Update test_flash_attn.py ( #17102 )
...
Signed-off-by: ShuaibinLi <lishuaibin@live.cn >
2025-04-26 22:17:35 +00:00
Reid
3642c59aa8
[CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh ( #16271 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-26 18:25:05 +00:00
Woosuk Kwon
43eea2953b
[Minor] Fix lint error in main branch ( #17233 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-26 11:10:14 -07:00
Kero Liang
de7eb10ce4
[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation ( #16878 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-04-26 10:41:35 -07:00
Ning Xie
fd11a325b8
[MISC] rename interval to max_recent_requests ( #14285 )
2025-04-26 16:59:18 +00:00
Lu Fang
4d17e20310
Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 ( #16573 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-04-26 09:17:58 -07:00
changjun.lee
10fd1d7380
[Bugfix] fix error due to an uninitialized tokenizer when using skip_tokenizer_init with num_scheduler_steps ( #9276 )
...
Signed-off-by: changjun.lee <pord7457@gmail.com >
2025-04-26 11:51:17 -04:00
Russell Bryant
52b4f4a8d7
[Docs] Update structured output doc for V1 ( #17135 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-26 15:12:18 +00:00
Aaron Pham
e782e0a170
[Chore] added stubs for vllm_flash_attn during development mode ( #17228 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-26 07:45:26 -07:00
Ning Xie
dc2ceca5c5
[BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set ( #17088 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-04-26 14:34:24 +00:00
Russell Bryant
f8acd01ff7
[V1] Add structural_tag support using xgrammar ( #17085 )
2025-04-26 14:06:37 +00:00
Agata Dobrzyniewicz
c48334d405
[Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device ( #17186 )
...
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai >
2025-04-26 05:55:14 -07:00
Cyrus Leung
909fdaf152
[Bugfix] Fix standard models tests ( #17217 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-26 02:26:41 -07:00
Isotr0py
8c1c926d00
[Bugfix] Fix missing int type for -n in multi-image example ( #17223 )
2025-04-26 08:49:52 +00:00
Nick Hill
df6f3ce883
[Core] Remove prompt string from engine core data structures ( #17214 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-25 23:41:05 -07:00
Woosuk Kwon
513f074766
[CI/test] Fix Eagle Correctness Test ( #17209 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-25 23:40:36 -07:00
Nick Hill
b07bf83c7d
[BugFix] Avoid race conditions in zero-copy tensor transmission ( #17203 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-26 06:00:07 +00:00
Zijing Liu
53e8cf53a4
[V1][Metrics] Allow V1 AsyncLLM to use custom logger ( #14661 )
...
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-25 22:05:40 -07:00
Charlie Fu
54271bb766
[ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. ( #17011 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-04-25 22:05:10 -07:00
Shu Wang
9e96f56efb
Allocate kv_cache with stride order ( #16605 )
...
Signed-off-by: shuw <shuw@nvidia.com >
2025-04-25 22:03:31 -07:00
Woosuk Kwon
b278911229
[Minor][Models] Fix Return Types of Llama & Eagle ( #17220 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-25 21:54:47 -07:00
yarongmu-google
7bd0c7745c
[Doc] Minor fix for the vLLM TPU setup page ( #17206 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-04-26 04:39:56 +00:00
Woosuk Kwon
1cf0719ebd
[Minor][Spec Decode] Add use_eagle to SpeculativeConfig ( #17213 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-25 21:08:15 -07:00
Reid
537d5ee025
[doc] add Anything LLM integration ( #17216 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-25 21:03:23 -07:00
Lu Fang
c8e5be35f7
[MISC][AMD] Add unused annotation to rocm kernel file ( #17097 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-04-25 20:33:35 -07:00
James Wu
a6e72e1e4f
[Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env ( #17142 )
...
Signed-off-by: James Wu <jjwu@meta.com >
2025-04-26 11:28:20 +08:00
Yihua Cheng
5e83a7277f
[v1] [P/D] Adding LMCache KV connector for v1 ( #16625 )
2025-04-26 03:03:38 +00:00
rasmith
68af5f6c5c
[AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary ( #17215 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-04-25 19:55:05 -07:00
Chen Zhang
8de2901fea
[Bugfix] gemma[2,3] interleaved attention when sliding window is disabled ( #17180 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-25 19:53:51 -07:00
Rui Qiao
c53e0730cb
[Misc] Refine ray_serve_deepseek example ( #17204 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-04-25 16:06:59 -07:00
Benjamin Chislett
a0e619e62a
[V1][Spec Decode] EAGLE-3 Support ( #16937 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Co-authored-by: Bryan Lu <yuzhelu@amazon.com >
2025-04-25 15:43:07 -07:00
Nick Hill
70116459c3
[BugFix][Frontend] Fix LLM.chat() tokenization ( #16081 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-25 22:20:05 +00:00
Christian Heimes
65e262b93b
Fix Python packaging edge cases ( #17159 )
...
Signed-off-by: Christian Heimes <christian@python.org >
2025-04-26 06:15:07 +08:00
Cyrus Leung
43faa0461a
[Bugfix] Fix hybrid model tests ( #17182 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-25 15:14:37 -07:00
Daniel Li
48cb2109b6
[V1] Move usage stats to worker and start logging TPU hardware ( #16211 )
2025-04-25 14:06:01 -06:00
Russell Bryant
a5450f11c9
[Security] Use safe serialization and fix zmq setup for mooncake pipe ( #17192 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-04-25 16:53:23 +00:00
Cyrus Leung
9d98ab5ec6
[Misc] Inline Molmo requirements ( #17190 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-25 16:41:44 +00:00
Reid
df5c879527
[doc] update wrong hf model links ( #17184 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-25 16:40:54 +00:00
Harry Mellor
423e9f1cbe
Use Transformers helper get_text_config() instead of checking for text_config ( #17105 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-25 08:47:35 -07:00
Harry Mellor
0bd7f8fca5
Bump Transformers to 4.51.3 ( #17116 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-25 08:34:34 -07:00
Jasmond L
d5615af9ae
[Bugfix] Fix Mistral ChatCompletionRequest Body Exception ( #16769 )
...
Signed-off-by: Jasmond Loh <Jasmond.Loh@hotmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-25 07:26:30 -07:00
Cyrus Leung
19dcc02a72
[Bugfix] Fix mistral model tests ( #17181 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-25 06:03:34 -07:00
Alex Brooks
7feae92c1f
[Doc] Move todo out of beam search docstring ( #17183 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-25 04:44:58 -07:00
Michael Yao
f851b84266
[Doc] Add two links to disagg_prefill.md ( #17168 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-25 10:23:57 +00:00
Lu Fang
fc966e9cc6
Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 ( #17158 )
2025-04-25 17:10:32 +08:00
Michael Yao
ef19e67d2c
[Doc] Add headings to improve gptqmodel.md ( #17164 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-25 01:13:13 -07:00
rasmith
a41351f363
[Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization ( #15734 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2025-04-25 00:45:02 -07:00
Sangyeon Cho
6aae216b4e
[Bugfix] remove fallback in guided_json (int range, patterns) ( #16725 )
...
Signed-off-by: csy1204 <josang1204@gmail.com >
Co-authored-by: 조상연[플레이스 AI] <sang-yeon.cho@navercorp.com >
2025-04-25 06:54:43 +00:00
yexin(叶鑫)
b22980a1dc
[Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance ( #16457 )
...
Signed-off-by: cynthieye <yexin93@qq.com >
Co-authored-by: MagnetoWang <magnetowang@outlook.com >
2025-04-25 14:52:28 +08:00
Lucas Wilkinson
881f735827
[Misc] Benchmark Serving Script Support Appending Results ( #17028 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-24 22:53:55 -07:00
Mengqing Cao
2f54045508
[Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton ( #15099 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-04-24 22:51:02 -07:00
Lifu Huang
5aa6efb9a5
[Misc] Clean up redundant code in uniproc_executor.py ( #16762 )
...
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com >
2025-04-24 22:49:30 -07:00
Harry Mellor
6ca0234478
Move missed SchedulerConfig args into scheduler config group in EngineArgs ( #17131 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 22:48:53 -07:00
Michael Goin
649818995f
[Docs] Fix True->true in supported_models.md ( #17141 )
2025-04-25 04:20:04 +00:00
Varun Sundar Rabindranath
7a0a9da72b
[Doc] V1 : Update LoRA status ( #17133 )
...
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com >
Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com >
2025-04-24 20:17:22 -07:00
Zaida Zhou
69bff9bc89
fix float16 support for kimi-vl ( #17156 )
...
Co-authored-by: zhouzaida <zhouzaida@msh.team >
2025-04-24 20:16:32 -07:00
Lucas Wilkinson
41ca7eb491
[Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 ( #16864 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-24 20:12:21 -07:00
vllmellm
eef364723c
[FEAT] [ROCm]: AITER Fused MOE V1 Support ( #16752 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-04-25 11:06:50 +08:00
jglaser
0d6e187e88
Use custom address for listening socket ( #15988 )
...
Signed-off-by: Jens Glaser <glaserj@ornl.gov >
2025-04-25 01:57:16 +00:00
Michael Goin
9420a1fc30
Better error message for missing mistral params.json ( #17132 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-24 23:43:08 +00:00
Rui Qiao
583e900996
[Misc] Add example to run DeepSeek with Ray Serve LLM ( #17134 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-04-24 22:25:21 +00:00
Maximilien de Bayser
05e1fbfc52
Add chat template for Llama 4 models ( #16428 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-04-24 20:19:36 +00:00
Yinghai Lu
fe92176321
Add collective_rpc to llm engine ( #16999 )
...
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai >
2025-04-24 20:16:52 +00:00
Russell Bryant
6d0df0ebeb
[Docs] Generate correct github links for decorated functions ( #17125 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-24 10:39:43 -07:00
Harry Mellor
0fa939e2d1
Improve configs - LoRAConfig + PromptAdapterConfig ( #16980 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 10:29:34 -07:00
Harry Mellor
0422ce109f
Add :markdownhelp: to EngineArgs docs so markdown docstrings render properly ( #17124 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 10:28:45 -07:00
Eyshika Agarwal
47bdee409c
Molmo Requirements ( #17026 )
...
Signed-off-by: Eyshika Agarwal <eyshikaengineer@gmail.com >
Signed-off-by: eyshika <eyshikaengineer@gmail.com >
2025-04-24 10:08:37 -07:00
Atilla
49f189439d
existing torch installation pip command fix for docs ( #17059 )
2025-04-24 10:07:21 -07:00
Aaruni Aggarwal
5adf6f6b7f
Updating builkite job for IBM Power ( #17111 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-04-24 10:06:17 -07:00
Russell Bryant
4115f19958
[CI] Add automation for the tool-calling github label ( #17118 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-24 09:22:00 -07:00
Mark McLoughlin
340d7b1b21
[V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics ( #16665 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-04-24 08:57:40 -07:00
Reid
1bcbcbf574
[Misc] refactor example series - structured outputs ( #17040 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-24 07:49:48 -07:00
Michael Goin
82e43b2d7e
Add missing rocm_skinny_gemms kernel test to CI ( #17060 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-24 07:49:37 -07:00
wang.yuqi
67309a1cb5
[Frontend] Using matryoshka_dimensions control the allowed output dimensions. ( #16970 )
2025-04-24 07:06:28 -07:00
Shanshan Shen
b724afe343
[V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning ( #16954 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-24 06:15:03 -07:00
Harry Mellor
21f4f1c9a4
Improve static type checking in LoRAModelRunnerMixin ( #17104 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 06:14:47 -07:00
Isotr0py
b0c1f6202d
[Misc] Remove OLMo2 config copy ( #17066 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-24 06:14:32 -07:00
Rui Qiao
c0dfd97519
[V1][PP] Optimization: continue scheduling prefill chunks ( #17080 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-04-24 05:27:08 -07:00
Harry Mellor
a9138e85b1
Fix OOT registration test ( #17099 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 04:44:12 -07:00
Harry Mellor
0a05ed57e6
Simplify TokenizerGroup ( #16790 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 04:43:56 -07:00
Michael Goin
14288d1332
Disable enforce_eager for V1 TPU sampler and structured output tests ( #17016 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-24 02:50:09 -07:00
Woosuk Kwon
b411418ff0
[Chore] Remove Sampler from Model Code ( #17084 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-24 02:49:33 -07:00
omer-dayan
2bc0f72ae5
Add docs for runai_streamer_sharded ( #17093 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-24 01:03:21 -07:00
Reid
9c1244de57
[doc] update to hyperlink ( #17096 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-24 00:58:08 -07:00
Reid
db2f8d915c
[V1] Update structured output ( #16812 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-23 23:57:17 -07:00
张宇
6167c0e5d2
[Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… ( #16472 )
...
Signed-off-by: 开哲 <kaizhe.zy@alibaba-inc.com >
Co-authored-by: 开哲 <kaizhe.zy@alibaba-inc.com >
2025-04-24 11:25:37 +08:00
Areeb Syed
ed2e464653
Addendum Fix to support FIPS enabled machines with MD5 hashing ( #17043 )
...
Signed-off-by: sydarb <areebsyed237@gmail.com >
2025-04-23 19:55:00 -07:00
Harry Mellor
2c8ed8ee48
More informative error when using Transformers backend ( #16988 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 19:54:03 -07:00
Michael Goin
ed50f46641
[Bugfix] Enable V1 usage stats ( #16986 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-23 19:54:00 -07:00
Woosuk Kwon
46e678bcff
[Minor] Use larger batch sizes for A100/B100/B200/MI300x ( #17073 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-23 19:18:59 -07:00
Chen Xia
6b2427f995
[Quantization]add prefix for commandA quantized model ( #17017 )
2025-04-23 17:32:40 -07:00
Sangyeon Cho
b07d741661
[CI/Build] workaround for CI build failure ( #17070 )
...
Signed-off-by: csy1204 <josang1204@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-04-23 16:14:18 -07:00
Woosuk Kwon
41fb013d29
[V1][Spec Decode] Always use argmax for sampling draft tokens ( #16899 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-23 14:57:43 -07:00
Yong Hoon Shin
32d4b669d0
[BugFix][V1] Fix int32 token index overflow when preparing input ids ( #16806 )
2025-04-23 12:12:35 -07:00
Travis Johnson
3cde34a4a4
[Frontend] Support guidance:no-additional-properties for compatibility with xgrammar ( #15949 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-04-23 18:34:41 +00:00
Harry Mellor
bdb3660312
Use @property and private field for data_parallel_rank_local ( #17053 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 08:50:08 -07:00
Harry Mellor
f3a21e9c68
CacheConfig.block_size should always be int when used (#17052 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 08:50:05 -07:00
Harry Mellor
8e630d680e
Improve Transformers backend model loading QoL ( #17039 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 07:33:51 -07:00
Russell Bryant
af869f6dff
[CI] Update structured-output label automation ( #17055 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-23 07:33:14 -07:00
Harry Mellor
53c0fa1e25
Ensure that pid passed to kill_process_tree is int for mypy ( #17051 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 07:32:26 -07:00
Michael Yao
f7912cba3d
[Doc] Add top anchor and a note to quantization/bitblas.md ( #17042 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-23 07:32:16 -07:00
Michael Goin
6317a5174a
Categorize tests/kernels/ based on kernel type ( #16799 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-23 09:21:07 -04:00
Michael Goin
aa72d9a4ea
Mistral-format support for compressed-tensors ( #16803 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-23 08:46:23 -04:00
Russell Bryant
ce17db8085
[CI] Run v1/test_serial_utils.py in CI ( #16996 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-23 01:13:34 -07:00
Chauncey
8c87a9ad46
[Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers ( #16964 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-23 07:24:09 +00:00
huafeng
ec69124eb4
[Misc] Improve readability of get_open_port function. ( #17024 )
...
Signed-off-by: gitover22 <qidizou88@gmail.com >
2025-04-23 06:16:53 +00:00
Lucas Wilkinson
d0da99fb70
[BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) ( #16998 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-22 21:49:24 -07:00
Nick Hill
b2f195c429
[V1] Avoid socket errors during shutdown when requests are in in-flight ( #16807 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-23 12:36:29 +08:00
vllmellm
047797ef90
[Bugfix] Triton FA function takes no keyword arguments ( #16902 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-04-22 21:35:24 -07:00
Reid
eb8ef4224d
[doc] add download path tips ( #17013 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-23 04:06:30 +00:00
Chendi.Xue
56a735261c
[INTEL-HPU][v0] Port delayed sampling to upstream ( #16949 )
...
Signed-off-by: Michal Adamczyk <michal.adamczyk@intel.com >
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai >
2025-04-22 20:14:11 -07:00
youkaichao
e1cf90e099
[misc] tune some env vars for GB200 ( #16992 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-23 10:59:48 +08:00
Chauncey
6bc1e30ef9
Revert "[Misc] Add S3 environment variables for better support of MinIO." ( #17021 )
2025-04-22 19:22:29 -07:00
vllmellm
7e081ba7ca
[BugFix] Revert ROCm Custom Paged Attention Env Flag Check ( #17022 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-04-22 19:17:48 -07:00
Nick Hill
1e013fa388
[V1][DP] More robust DP/EP dummy request coordination ( #16277 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-22 19:12:15 -07:00
Aleksandr Malyshev
bc7c4d206b
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 ( #13305 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu >
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com >
Signed-off-by: maleksan85 <maleksan@amd.com >
Signed-off-by: <>
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: qli88 <qiang.li2@amd.com >
Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com >
2025-04-22 19:11:56 -07:00
Yang Wang
f67e9e9f22
add Dockerfile build vllm against torch nightly ( #16936 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-04-22 19:08:27 -07:00
Guillaume Calmettes
36fe78769f
[Bugfix] validate urls object for multimodal content parts ( #16990 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-23 09:43:06 +08:00
Chenyaaang
83d933718c
[Core][V1][TPU] Enable structured decoding on TPU V1 ( #16499 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-22 18:05:23 -06:00
Nick Hill
5175b884f7
[BugFix] Remove default multiproc executor collective_rpc timeout ( #17000 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-22 23:27:14 +00:00
Alexei-V-Ivanov-AMD
5536b30a4c
Fencing Kernels Tests for enabling on AMD ( #16929 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-22 09:32:40 -07:00
Richard Zou
7f58fb9718
Add assertion for no objects while hashing hf_config ( #16930 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-22 09:32:22 -07:00
vllmellm
30bc3e0f66
[FEAT][ROCm]: Support AITER MLA ( #15893 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: qli88 <qiang.li2@amd.com >
2025-04-22 09:31:13 -07:00
Reid
f34410715f
[frontend] enhance tool_calls type check ( #16882 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-22 15:40:24 +00:00
Chauncey
68d4c33202
[Misc] Add S3 environment variables for better support of MinIO. ( #16977 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-22 14:27:36 +00:00
Zhengyuan Su (苏政渊)
f961d7f6ef
[BugFix] Pass in correct VLLM config in FlashInfer backend ( #13207 ) ( #16973 )
...
Signed-off-by: 苏政渊 <suzhengyuan@moonshot.cn >
Co-authored-by: 苏政渊 <suzhengyuan@moonshot.cn >
2025-04-22 06:44:10 -07:00
Harry Mellor
d059110498
Improve configs - SpeculativeConfig ( #16971 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-22 12:55:36 +00:00
Yang Fan
571e8dd65e
[Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni ( #16974 )
...
Signed-off-by: fyabc <suyang.fy@alibaba-inc.com >
2025-04-22 12:23:17 +00:00
Reid
4b91c927f6
[Misc] refactor example series ( #16972 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-22 11:44:21 +00:00
vllmellm
0e237f0035
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER ( #15001 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-04-22 02:46:28 -07:00
Cyrus Leung
8f7bace7c3
[Doc] Improve documentation for multimodal CLI args ( #16960 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-22 08:35:35 +00:00
Nick Hill
e4d6144232
[BugFix] Fix incremental detokenization perf issue ( #16963 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-22 08:16:19 +00:00
Lei Wang
8d32dc603d
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS ( #6036 )
...
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com >
Co-authored-by: xinyuxiao <xinyuxiao2024@gmail.com >
2025-04-22 09:01:36 +01:00
Woosuk Kwon
c4ab9f3e71
[V1] Remove pre-allocation for KV cache ( #16941 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-22 00:52:18 -07:00
Flora Feng
2689d5c027
[Model] Use autoweightloader for mamba ( #16950 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-04-22 07:48:15 +00:00
Chauncey
acba33a0f1
[Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams ( #16767 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-04-22 06:02:20 +00:00
SnowCharm
a114bf20a3
[Perf] Optimize _update_states for GPU model runner ( #16910 )
...
Signed-off-by: snowcharm <snowcharmqq@gmail.com >
2025-04-22 14:01:54 +08:00
Michael Yao
3097ce3a32
[Doc] Update ai_accelerator/hpu-gaudi.inc.md ( #16956 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-22 05:33:27 +00:00
Cyrus Leung
d6da9322c8
[Bugfix] Fix f-string for Python 3.9-3.11 ( #16962 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-21 21:45:55 -07:00
omer-dayan
71ce44047f
Support S3 Sharded loading with RunAI Model Streamer ( #16317 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-21 21:21:49 -07:00
Charlie Fu
188b7f9b8c
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm ( #15830 )
...
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-04-21 20:46:22 -07:00
wangxiyuan
b9b4746950
[V1] Remove additional_config check ( #16710 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-04-21 20:45:27 -07:00
Varun Sundar Rabindranath
7b8a2ab76f
[Kernel] Add expert_map support to Cutlass FP8 MOE ( #16861 )
...
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com >
Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com >
2025-04-21 20:44:32 -07:00
Jee Jee Li
c9acbf1141
[Misc] Remove the chunked prefill warning for LoRA ( #16925 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-21 20:44:24 -07:00
kliuae
5b794cae8d
[ROCm] Add aiter tkw1 kernel for Llama4 fp8 ( #16727 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-04-21 20:42:34 -07:00
Jeffrey Li
0e4254492f
[Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other ( #16863 )
...
Signed-off-by: Jeffrey Li <jeffrey.dot.li@gmail.com >
2025-04-22 11:40:19 +08:00
Woosuk Kwon
1311913f55
[BugFix][Spec Decode] No in-place update to draft probs ( #16952 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-21 19:54:19 -07:00
Cyrus Leung
29f395c97c
[Doc] Remove unnecessary V1 flag ( #16924 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-21 21:04:38 -04:00
Nicolò Lucchesi
fa3bba2a53
[TPU][V1] Enable Top-P ( #16843 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-22 00:46:07 +00:00
Michael Goin
986537f1c3
[V1] V1 FlashInfer Attention ( #16684 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Aurick Qiao <qiao@aurick.net >
2025-04-22 00:38:41 +00:00
Nicolò Lucchesi
210207525e
[TPU][V1] Capture multimodal encoder during model compilation ( #15051 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Siyuan Liu <lsiyuan@google.com >
2025-04-21 18:36:59 -06:00
Michael Goin
71eda0bb76
Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml ( #16946 )
2025-04-21 18:35:32 -06:00
Chengji Yao
471fe65630
[TPU][V1] Implicitly adjust page size when there's SMEM OOM ( #16871 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-21 15:43:13 -06:00
Woosuk Kwon
3a0fba5cf4
[V1][Spec Decode] Handle draft tokens beyond max_model_len ( #16087 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-21 12:38:50 -07:00
Chanh Nguyen
299ebb62b2
[Core] Speed up decode by remove synchronizing operation in sampler ( #16436 )
...
Signed-off-by: Chanh Nguyen <cnguyen@linkedin.com >
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com >
2025-04-21 18:18:22 +00:00
David Xia
f728ab8e35
[Doc] mention how to install in CPU editable mode ( #16923 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-04-21 17:45:51 +00:00
David Xia
63e26fff78
[doc] install required python3-dev apt package ( #16888 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-04-21 16:15:18 +00:00
Yan Ma
fe3462c774
[XPU][Bugfix] minor fix for XPU ( #15591 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-04-22 00:02:57 +08:00
Kartik Ramesh
3b34fd5273
Raise error for data-parallel with benchmark_throughput ( #16737 )
...
Signed-off-by: Kartik Ramesh <kartikx2000@gmail.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-04-21 23:51:43 +08:00
Isotr0py
55d6d3fdb8
[Bugfix] Fix GLM rotary_dim issue and support v1 ( #16912 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
2025-04-21 14:26:34 +00:00
Shanshan Shen
7272bfae77
[Misc] Refactor platform to get device specific stream and event ( #14411 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-21 21:25:49 +08:00
wangxiyuan
d9ac9e3dc5
[Misc] fix collect_env version parse ( #15267 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-04-21 20:29:40 +08:00
Han Zhang
d41faaf9df
Restore buffers when wake up from level 2 sleep ( #16564 ) ( #16889 )
...
Signed-off-by: Han <zh950713@gmail.com >
2025-04-21 20:18:28 +08:00
Alex Brooks
b34f33438a
[Doc] Split dummy_processor_inputs() in Multimodal Docs ( #16915 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-21 11:10:01 +00:00
Yang Fan
26c0406555
[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni ( #16907 )
2025-04-21 10:25:21 +00:00
Woosuk Kwon
4c41278b77
[CI/CD][V1] Add spec decode tests to CI ( #16900 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-20 22:37:16 -07:00
qizixi
bb3605db85
[Bugfix] Fix v1/spec_decode/test_ngram.py ( #16895 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-04-20 20:54:29 -07:00
Richard Zou
fe742aef5a
[easy] Pass compile_fx only the config patches ( #16845 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-20 12:25:19 +08:00
Harry Mellor
4b07d36891
Improve configs - CacheConfig ( #16835 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-20 12:25:04 +08:00
Staszek Paśko
87aaadef73
Serialize tensors using int8 views ( #16866 )
...
Signed-off-by: Staszek Pasko <staszek@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-19 10:28:34 -07:00
Richard Zou
682e0b6d2f
Log how much time loading a compiled artifact takes ( #16848 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-19 16:50:46 +00:00
Reid
d6195a748b
[doc] update hyperlink ( #16877 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-19 16:40:38 +00:00
Cyrus Leung
205d84aaa9
[VLM] Clean up models ( #16873 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-19 12:13:06 +00:00
Roger Wang
5124f5bf51
[Model] Qwen2.5-Omni Cleanup ( #16872 )
2025-04-19 09:37:02 +00:00
Isotr0py
83f3c3bd91
[Model] Refactor Phi-4-multimodal to use merged processor and support V1 ( #15477 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-19 02:26:11 -07:00
vie-serendipity
d9737ca1c6
[V1][Misc] stop update prefix cache stats when logs_stats is disabled ( #16460 )
...
Signed-off-by: vie-serendipity <2733147505@qq.com >
2025-04-19 02:25:19 -07:00
Nicolò Lucchesi
9d4ca19d50
[Misc] Benchmarks for audio models ( #16505 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-19 02:24:14 -07:00
Nicolò Lucchesi
2ef0dc53b8
[Frontend] Add sampling params to v1/audio/transcriptions endpoint ( #16591 )
...
Signed-off-by: Jannis Schönleber <joennlae@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Jannis Schönleber <joennlae@gmail.com >
2025-04-19 07:03:54 +00:00
Divakar Verma
1d4680fad2
[rocm][MI300] llama4 maverick fp8 moe config tp8 ( #16847 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-04-19 06:21:43 +00:00
Yang Fan
2c1bd848a6
[Model][VLM] Add Qwen2.5-Omni model support (thinker only) ( #15130 )
...
Signed-off-by: fyabc <suyang.fy@alibaba-inc.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Xiong Wang <wangxiongts@163.com >
2025-04-18 23:14:36 -07:00
omrishiv
5c9121203c
[release] Publish neuron docker image ( #16733 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2025-04-18 17:11:25 -07:00
Justin Ho
490b1698a5
[Doc] Updated Llama section in tool calling docs to have llama 3.2 config info ( #16857 )
...
Signed-off-by: jmho <jaylenho734@gmail.com >
2025-04-18 23:28:53 +00:00
Reid
5a5e29de88
[Misc] refactor examples series - Chat Completion Client With Tools ( #16829 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-18 23:24:42 +00:00
wang.yuqi
3d3ab3689f
[New Model]: Snowflake Arctic Embed (Family) ( #16649 )
2025-04-18 08:11:57 -07:00
Harry Mellor
686623c5e7
Fix nullable_kvs fallback ( #16837 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-18 05:58:39 -07:00
Cyrus Leung
aadb656562
[Misc] Clean up Kimi-VL ( #16833 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-18 05:15:09 -07:00
Jonghyun Choe
87e067de41
[Model] use AutoWeightsLoader for BigCode, GPT-J ( #16823 )
...
Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com >
2025-04-18 10:42:41 +00:00
Michael Yao
26507f8973
[Docs] Fix a link and grammar issue in production-stack.md ( #16809 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-18 06:42:58 +00:00
Nathan Weinberg
9c1d5b456d
[Doc] add podman setup instructions for official image ( #16796 )
...
Signed-off-by: Nathan Weinberg <nweinber@redhat.com >
2025-04-18 06:10:49 +00:00
Lucia Fang
e31045f95c
[Bugfix] fix pp for llama4 ( #16746 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-04-18 13:51:30 +08:00
Luka Govedič
aaec845f8e
[ROCm] [Attention] Cleanup ROCm output passing ( #16431 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-04-18 05:46:45 +00:00
rongfu.leng
7bdfd29a35
[Misc] add collect_env to cli and docker image ( #16759 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-17 22:13:35 -07:00
Harry Mellor
e78587a64c
Improve-mm-and-pooler-and-decoding-configs ( #16789 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 22:13:32 -07:00
Lucas Wilkinson
7eb4255628
[BugFix] Accuracy fix for llama4 int4 - improperly casted scales ( #16801 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-17 22:13:29 -07:00
Michael Goin
6a0f547561
Add hardware print to TPU V1 test ( #16792 )
2025-04-17 22:13:26 -07:00
Shanshan Shen
30ed81b7ca
[V1][Structured Output] Minor modification to _validate_structured_output() ( #16748 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-18 13:12:54 +08:00
Chauncey
7a4a5de729
[Misc] Update outdated note: LMCache now supports chunked prefill ( #16697 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-18 05:12:42 +00:00
Cyrus Leung
c16fb5dae8
[Doc] Improve help examples for --compilation-config ( #16729 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-17 21:22:34 -07:00
Tarun Kumar
e37073efd7
Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema ( #16721 )
...
Signed-off-by: Tarun Kumar <takumar@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-17 21:08:27 -07:00
Lucas Wilkinson
183dad7a85
[Attention] Update to lastest FA3 code ( #13111 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-17 15:14:07 -07:00
Yihua Cheng
3408e47159
[P/D][V1] KV Connector API V1 ( #15960 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: remi <remi@mistral.ai >
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-04-17 13:22:40 -07:00
Nick Hill
0377b8310b
[MLA] Simplification to batch P/D reordering ( #16673 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-17 16:12:09 -04:00
Mark McLoughlin
e4755f7fac
[V1][Metrics] Fix http metrics middleware ( #15894 )
2025-04-17 19:52:18 +00:00
Sijia(Jackson) Chen
92edf35826
[ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints ( #16674 )
2025-04-17 11:44:34 -07:00
Nicolò Lucchesi
eb5819b2d9
[V1][TPU] Enable Top K ( #15489 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com >
Co-authored-by: Hyesoo Yang <hyeygit@gmail.com >
2025-04-17 18:18:11 +00:00
Nicolò Lucchesi
5989f4684d
[TPU][V1] Fix padding recompilation when max-num-batched-tokens is not even ( #16726 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-17 18:09:57 +00:00
rongfu.leng
5125d72f02
[Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small ( #16548 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-17 17:48:31 +00:00
Ximingwang-09
a018e555fd
[Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 ( #16753 )
...
Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com >
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com >
2025-04-18 00:01:30 +08:00
Robin
6211b92273
[Bugfix]Fix index out of range error in api server log ( #16787 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-04-17 09:01:07 -07:00
Nick Hill
05fcd1b430
[V1][Perf] Faster incremental detokenization ( #15137 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-17 07:45:24 -07:00
Insu Kim
7c02d6a137
[Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion ( #16784 )
...
Signed-off-by: insukim1994 <insu.kim@moreh.io >
2025-04-17 14:10:08 +00:00
wang.yuqi
11c3b98491
[Doc] Document Matryoshka Representation Learning support ( #16770 )
2025-04-17 13:37:37 +00:00
Cyrus Leung
dbe7f07001
[Doc] Make sure to update vLLM when installing latest code ( #16781 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-17 06:53:31 -06:00
Reid
c69bf4ee06
fix: hyperlink ( #16778 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-17 11:34:20 +00:00
Harry Mellor
d27ea94034
Improve configs - TokenizerPoolConfig + DeviceConfig ( #16603 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 11:19:42 +00:00
Reid
99ed526101
[Misc] refactor examples series - lmcache ( #16758 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-17 11:02:35 +00:00
Michael Yao
207da28186
[Doc] Fix a 404 link in installation/cpu.md ( #16773 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-17 10:46:21 +00:00
intervitens
5b1aca2ae3
[Bugfix] Fix GLM4 model ( #16618 )
...
Signed-off-by: intervitens <intervitens@tutanota.com >
2025-04-17 03:35:07 -07:00
Reid
d8e557b5e5
[doc] add open-webui example ( #16747 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-17 18:27:32 +08:00
Cyrus Leung
61a44a0b22
[Doc] Add more tips to avoid OOM ( #16765 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-17 09:54:34 +00:00
DefTruth
a6481525b8
[misc] ignore marlin_moe_wna16 local gen codes ( #16760 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-17 17:15:14 +08:00
Richard Liaw
8cac35ba43
[Ray] Improve documentation on batch inference ( #16609 )
...
Signed-off-by: Richard Liaw <rliaw@berkeley.edu >
2025-04-16 22:19:26 -07:00
Russell Bryant
9dbf7a2dc1
[V1] Remove log noise when idle ( #16735 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-16 21:34:08 -07:00
David Heineman
607029e515
[Bugfix] Revert max_prompt_len validation for decoder-only models. ( #16741 )
...
Signed-off-by: David Heineman <david@davidheineman.com >
2025-04-16 21:33:15 -07:00
Isotr0py
cb072ce93b
[Bugfix] Update Florence-2 tokenizer to make grounding tasks work ( #16734 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-17 04:17:39 +00:00
Divakar Verma
95aca283b4
[rocm][V0] fix selection logic for custom PA in V0 ( #16426 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-04-16 19:52:11 -07:00
Robert Shaw
2b05b8ce69
[V1][Frontend] Improve Shutdown And Logs ( #11737 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-16 19:48:34 -07:00
Aaruni Aggarwal
3c776dcefb
Adding vllm buildkite job for IBM Power ( #16679 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-04-17 10:47:47 +08:00
Bryan Lu
2cbd4d2999
[V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification ( #16636 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
2025-04-16 19:47:26 -07:00
Staszek Paśko
3092375e27
[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] ( #16432 )
...
Signed-off-by: Staszek Pasko <staszek@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-16 19:28:32 -07:00
Harry Mellor
3cd91dc955
Help user create custom model for Transformers backend remote code models ( #16719 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 01:05:59 +00:00
Jade Zheng
8a7368e069
[Misc] Remove redundant comment ( #16703 )
...
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
2025-04-17 00:44:52 +00:00
Harry Mellor
93e561ec4d
Improve error for structured output backend selection ( #16717 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 00:35:35 +00:00
Joe Runde
e1b004839a
[Hardware] Add processor inputs to platform validation ( #16680 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-04-16 09:28:42 -07:00
xsank
ee378f3d49
[Model] support modernbert ( #16648 )
...
Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com >
Co-authored-by: 唯勤 <xsank.mz@alibaba-inc.com >
2025-04-16 05:30:15 -07:00
DefTruth
e82ee40de3
[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel ( #16693 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-16 03:31:39 -07:00
Cyrus Leung
facbe2a114
[Doc] Improve OOM troubleshooting ( #16704 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-16 18:29:48 +08:00
Reid
7168920491
[Misc] refactor examples series ( #16708 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-16 10:16:36 +00:00
Kay Yan
21378a2323
[CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook ( #16405 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-04-16 10:05:31 +00:00
Shanshan Shen
976711d9db
[V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py ( #16578 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-16 17:01:36 +08:00
Sage Moore
44fa4d556c
[ROCM] Bind triton version to 3.2 in requirements-built.txt ( #16664 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-04-16 14:05:28 +08:00
billishyahao
3ac98edcb1
[Feature] add model aware kv ops helper ( #16020 )
...
Signed-off-by: billishyahao <bill.he@amd.com >
2025-04-15 23:00:43 -07:00
Richard Zou
966c742ed2
Disable remote caching when calling compile_fx ( #16611 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-15 22:18:28 -07:00
Jee Jee Li
0d7d05f4b6
[Misc] Modify LRUCache touch ( #16689 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-16 04:51:38 +00:00
rongfu.leng
96bb8aa68b
[Bugfix] fix gpu docker image mis benchmarks dir ( #16628 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-15 21:21:14 -07:00
Shinichi Hemmi
3badb0213b
[Model] Add PLaMo2 ( #14323 )
...
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com >
Signed-off-by: shemmi <shemmi@preferred.jp >
Co-authored-by: Kento Nozawa <nzw0301@preferred.jp >
Co-authored-by: Hiroaki Mikami <mhiroaki@preferred.jp >
Co-authored-by: Calvin Metzger <metzger@preferred.jp >
2025-04-15 19:31:30 -07:00
Angky William
fdcb850f14
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server ( #10546 )
...
Signed-off-by: Angky William <angkywilliam@Angkys-MacBook-Pro.local >
Co-authored-by: Angky William <angkywilliam@Angkys-MacBook-Pro.local >
2025-04-15 22:31:38 +00:00
Dipika Sikka
54a66e5fee
[Misc] Update compressed-tensors WNA16 to support zero-points ( #14211 )
2025-04-15 07:33:51 -06:00
DefTruth
280d62b8a2
[Kernel] Remove redundant Exp calculations ( #16123 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-15 12:58:37 +00:00
Xihui Cang
1666e66443
Add "/server_info" endpoint in api_server to retrieve the vllm_config. ( #16572 )
...
Signed-off-by: Xihui Cang <xihuicang@gmail.com >
2025-04-15 11:50:38 +00:00
Jee Jee Li
1575c1701a
[CI/Build] Fix LoRA OOM ( #16624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-15 16:38:19 +08:00
Reid
6ae996a873
[Misc] refactor argument parsing in examples ( #16635 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-15 08:05:30 +00:00
Richard Zou
b590adfdc1
Fix vLLM x torch.compile config caching ( #16491 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-14 23:11:11 -07:00
Michael Goin
b4fe16c75b
Add vllm bench [latency, throughput] CLI commands ( #16508 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-14 23:10:35 -07:00
Pooya Davoodi
bc5dd4f669
[Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) ( #16631 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-04-14 23:09:58 -07:00
Tyler Michael Smith
dbb036cf61
[Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py ( #16623 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-04-15 05:35:38 +00:00
Taneem Ibrahim
70e7ed841d
[BugFix]: Update minimum pyzmq version ( #16549 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-04-14 20:06:03 -07:00
Jinzhen Lin
d06ba4ed3f
[Kernel] moe wna16 marlin kernel ( #14447 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-14 20:05:22 -07:00
Alex Brooks
6b40996ae8
[Core][Bugfix] Fix Offline MM Beam Search ( #16390 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-15 10:33:02 +08:00
Shuqiao Li
d2020acac7
config check sleep mode support oot platforms ( #16562 )
2025-04-14 16:31:50 -07:00
Chengji Yao
1eb3c2ed48
[DOC][TPU] Add core idea about avoiding recompilation after warmup ( #16614 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-14 21:56:06 +00:00
Siyuan Liu
c64ee87267
[Hardware][TPU] Add torchvision to tpu dependency file ( #16616 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-04-14 17:50:46 -04:00
courage17340
b1308b84a3
[Model][VLM] Add Kimi-VL model support ( #16387 )
...
Signed-off-by: courage17340 <courage17340@163.com >
2025-04-14 21:41:48 +00:00
Nishan Acharya
7b5ecf79bd
s390x: Fix PyArrow build and add CPU test script for Buildkite CI ( #16036 )
...
Signed-off-by: Nishan Acharya <Nishan.Acharya@ibm.com >
2025-04-14 10:55:32 -07:00
Harry Mellor
9883a18859
Fix triton install condition on CPU ( #16600 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-14 17:06:01 +00:00
Nicolò Lucchesi
b3f2fddd17
[TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 ( #16596 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-14 17:01:05 +00:00
Cyrus Leung
aa29841ede
[Bugfix] Multi-modal caches not acting like LRU caches ( #16593 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-14 09:24:16 -07:00
Md. Shafi Hussain
6bf27affb6
[fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet ( #16048 )
...
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
2025-04-14 17:08:39 +01:00
shangmingc
1dd23386ec
[Misc] Update usage with mooncake lib for kv transfer ( #16523 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-04-14 11:31:37 +00:00
Reid
7cbfc10943
[Misc] refactor examples ( #16563 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-14 09:59:15 +00:00
DefTruth
ce4ddd2d1a
[Misc] remove warning if triton>=3.2.0 ( #16553 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-14 02:39:47 -07:00
Harry Mellor
e51929ebca
Improve configs - SchedulerConfig ( #16533 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-14 17:24:16 +08:00
Russell Bryant
dc1b4a6f13
[Core][V0] Enable regex support with xgrammar ( #13228 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-14 10:13:38 +08:00
Jennifer Zhao
63d2705edb
[Benchmark][Bugfix] Fix SonnetDataset default values in benchmark_throughput.py ( #16556 )
2025-04-13 17:20:26 -07:00
Michael Goin
d085a44082
Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) ( #16537 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-13 14:55:18 +00:00
Lily Liu
f49e5aff11
[V1][Spec Decode] KV cache slots for eagle heads ( #16370 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-04-12 19:42:51 -07:00
Ryan McConville
6c11ecf8d3
[Bugfix] Validate logit biases to prevent out of vocab ids crashing engine ( #16529 )
...
Signed-off-by: Ryan McConville <ryan@ryanmcconville.com >
2025-04-12 20:19:19 +00:00
SnowCharm
93e5f3c5fb
[Perf] Optimize Preparing Inputs for GPU Model Runner ( #16484 )
...
Signed-off-by: snowcharm <snowcharmqq@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-12 22:54:37 +08:00
Jie Fu (傅杰)
70363bccfa
Fix syntaxWarning: invalid escape sequence '\s' ( #16532 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com >
2025-04-12 14:39:42 +00:00
Jee Jee Li
3cdc57669f
[Misc] Delete redundant code ( #16530 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-04-12 11:21:37 +00:00
Huazhong Ji
68bb122eb4
[MISC] Make GroupCoordinator compatible with out-of-tree devices ( #16464 )
...
Signed-off-by: hzji210@gmail.com <hzji210@gmail.com >
2025-04-12 09:20:25 +00:00
Cyrus Leung
d9fc8cd9da
[V1] Enable multi-input by default ( #15799 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-12 08:52:39 +00:00
Nicolò Lucchesi
f069f3ea74
[Misc] Openai transcription client example use same Whisper model ( #16487 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-12 07:27:03 +00:00
Cyrus Leung
c5bc0e7fcc
[Misc] Update chat utils tests ( #16520 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-12 06:48:43 +00:00
Tianer Zhou
4a3a518722
fix: spelling ( #16466 )
...
Signed-off-by: Tianer Zhou <ezhoureal@gmail.com >
2025-04-11 23:24:22 -07:00
wang.yuqi
fbf722c6e6
[Frontend] support matryoshka representation / support embedding API dimensions ( #16331 )
2025-04-11 23:23:10 -07:00
leon-seidel
e92d7085bf
[Feature][V1] Add xgrammar to support minLength, maxLength with test ( #16516 )
...
Signed-off-by: Leon Seidel <leon.seidel@fau.de >
2025-04-11 23:22:07 -07:00
Michael Goin
bd6028d6b0
Optimized topk for topk=1 (Llama-4) ( #16512 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-12 14:21:08 +08:00
Ye (Charlotte) Qi
802329dee9
[Doc] Update Llama4 Model Names in Supported Models ( #16509 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-04-12 02:53:10 +00:00
Nick Hill
41cc883c29
[BugFix] Handle non-contiguous tensors properly when serializing ( #16492 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-11 17:54:06 -07:00
Michael Goin
57504a4bcf
[CI][Bugfix] Add mistral_tool_use to Ci ( #16517 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 17:52:38 -07:00
Yuan Tang
ed4792c990
[Doc] Fix link to vLLM blog ( #16519 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-04-11 17:39:23 -07:00
Michael Goin
87b836ba77
Bugfix for PixtralHF models without spatial_merge_size ( #16513 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 23:32:22 +00:00
rongfu.leng
56c76c2e0e
[Bugfix] clean up duplicated code ( #16485 )
...
Signed-off-by: Gogs <gogs@fake.local >
Co-authored-by: Gogs <gogs@fake.local >
2025-04-11 23:19:40 +00:00
Christian Sears
c09632a66c
Update openai_compatible_server.md ( #16507 )
...
Signed-off-by: Christian Sears <csears@redhat.com >
2025-04-11 22:54:58 +00:00
Yong Hoon Shin
a3bf8d4a2b
[Kernel] Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 ( #16488 )
2025-04-12 06:26:55 +08:00
Ye (Charlotte) Qi
16eda8c43a
[Frontend] Added chat templates for LLaMa4 pythonic tool calling ( #16463 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Co-authored-by: Kai Wu <kaiwu@meta.com >
2025-04-12 06:26:17 +08:00
Harry Mellor
cd77382ac1
Improve configs - LoadConfig ( #16422 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-11 20:27:27 +00:00
Travis Johnson
71b9cde010
[Bugfix] handle alignment of encoder_seq_lens in mllama.py ( #14784 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-04-11 19:59:50 +00:00
Isotr0py
5285589f37
[Doc] Document InternVL3 support ( #16495 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-11 19:41:09 +00:00
Michael Goin
f41647ee6b
[Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel ( #16366 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 17:54:08 +00:00
Nicolò Lucchesi
4d022cbc75
[TPU][V1] Make --disable_chunked_mm_input mandatory for serving MM models ( #16483 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-11 17:06:14 +00:00
Richard Zou
70de35a881
Fix erroneous "model doesn't support compile" warning ( #16486 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-11 16:24:36 +00:00
Tomasz Zielinski
34b2cf3b33
[Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU ( #12779 )
...
Signed-off-by: Tomasz Zielinski <tomasz.zielinski@intel.com >
2025-04-11 07:38:36 -07:00
chaow-amd
9e90c9f73f
[Bugfix] Fix bugs of running Quark quantized models ( #16236 )
...
Signed-off-by: chaow <chaow@amd.com >
2025-04-11 10:18:32 -04:00
DefTruth
e9528f6dc6
[Kernel] support merge_attn_states CUDA kernel, 3x speedup ( #16173 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-11 06:50:50 -06:00
Harry Mellor
51baa9c333
Don't install triton on ppc64le platform ( #16470 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-11 10:11:00 +00:00
Reid
35e076b3a8
[Misc] update api_client example ( #16459 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-11 10:05:40 +00:00
Jee Jee Li
a26f59ccbc
[Misc] Raise error for V1 not supporting Long LoRA. ( #16415 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-11 01:51:20 -07:00
Michael Goin
aa3b3d76e0
Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True ( #16447 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 08:09:52 +00:00
Jee Jee Li
f7030df3be
[Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner ( #15990 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-11 15:32:37 +08:00
DefTruth
905e91e9ac
Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" ( #16453 )
2025-04-11 06:44:22 +00:00
Alex Brooks
f8f9c0ba62
[Bugfix] Don't set an upper bound on repetition penalty ( #16403 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-11 14:19:40 +08:00
Li, Jiang
dda811021a
[CPU][Bugfix] Fix CPU docker issues ( #16454 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-04-11 14:19:07 +08:00
Isotr0py
93195146ea
[Bugfix][VLM] Fix failing Phi-4-MM multi-images tests and add vision-speech test ( #16424 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-11 04:57:16 +00:00
Michael Goin
ed37599544
Update supported_hardware.md for TPU INT8 ( #16437 )
2025-04-11 12:28:07 +08:00
Yong Hoon Shin
99ef59cf7f
[Llama4] Enable attention temperature tuning by default for long context (>32k) ( #16439 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-04-10 21:26:07 -07:00
Chenyaaang
d544d141ec
update benchmark_serving_structured_output to include auto backend ( #16438 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-11 12:25:52 +08:00
Alexey Belyakov
3e397a9484
check input length of sonnet samples ( #16423 )
...
Signed-off-by: alexey-belyakov <alexey.belyakov@intel.com >
2025-04-11 10:15:06 +08:00
WWW
268c325078
Fix range_ratio Bug in RandomDataset ( #16126 )
...
Signed-off-by: jadewang21 <jadewangcn@outlook.com >
2025-04-10 15:31:17 -07:00
Nicolò Lucchesi
3cc9af88ff
[TPU][V1] Disable per-request seed/Generator ( #16172 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-10 17:05:44 -04:00
look
7cd0bd7212
[Bugfix] Fix output token length check logic ( #16419 )
...
Signed-off-by: look <eeslook@163.com >
2025-04-10 20:16:48 +00:00
Cyrus Leung
56d4aefa33
[VLM] Avoid unnecessary dummy multimodal data during processing ( #16416 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-10 19:32:14 +00:00
Nick Hill
dd143ef541
[V1] Zero-copy tensor/ndarray serialization/transmission ( #13790 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-10 19:23:14 +00:00
Chih-Chieh Yang
daefed052c
[Model] Reduce redundant computations in mamba2 blocks for Bamba-9B ( #15423 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-04-10 19:07:07 +00:00
Chenyaaang
5fbab20e02
[Bugfix] Fix bug when dataset is json ( #15899 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-10 18:35:41 +00:00
Lily Liu
e8224f3dca
[V1][Spec Decode] Eagle Model loading ( #16035 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-04-10 11:21:48 -07:00
Russell Bryant
9665313c39
[V1] Set structured output backend to auto by default ( #15724 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-10 17:53:26 +00:00
Harry Mellor
0c54fc7273
Improve configs - ParallelConfig ( #16332 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-10 17:34:37 +00:00
Nicolò Lucchesi
c1b57855ec
[TPU][V1] Use language_model interface for getting text backbone in MM ( #16410 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-10 17:32:04 +00:00
Cyrus Leung
83b824c8b4
[VLM] Remove BaseProcessingInfo.get_mm_max_tokens_per_item ( #16408 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-10 09:06:58 -07:00
Lu Fang
7678fcd5b6
Fix the torch version parsing logic ( #15857 )
2025-04-10 07:37:47 -07:00
wineandchord
8661c0241d
[CI] Add auto update workflow for Dockerfile graph ( #11879 )
...
Signed-off-by: wineandchord <guoqizhou19@gmail.com >
2025-04-10 13:43:05 +00:00
Reid
ce8d6b75fc
[doc] update the wrong link ( #16401 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-10 21:02:37 +08:00
Ye (Charlotte) Qi
61de3ef74b
[Model] Remove image mm limit for LLaMa4 ( #16365 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-04-10 09:36:27 +00:00
cyyever
ec1f9c8c91
Update Numba to 0.61.2 ( #16376 )
...
Signed-off-by: cyy <cyyever@outlook.com >
2025-04-10 07:59:37 +00:00
Reid
65e09094c4
[doc] add download model tips ( #16389 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-10 07:45:26 +00:00
Michael Goin
c70cf0fe06
[Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models ( #16038 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-10 15:08:47 +08:00
Cyrus Leung
a5d11a54dc
[Bugfix] Fix validation error for text-only Mllama 3.2 ( #16377 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-10 14:19:42 +08:00
Cyrus Leung
3d4c87758e
[Misc] Update transformers version limits of multi-modal tests ( #16381 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-09 23:03:33 -07:00
Aaron Ang
a9bd832fc5
[Model] use AutoWeightsLoader for deepseek_v2, internlm2 ( #16383 )
...
Signed-off-by: Aaron Ang <aaron.angyd@gmail.com >
2025-04-09 23:01:00 -07:00
Chenyaaang
417bcefbae
fix sonnet dataset sample when prefix len is very small ( #16379 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-10 05:35:07 +00:00
Michael Goin
baada0e737
[Bugfix][TPU] Fix TPU validate_request ( #16369 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-04-10 12:55:12 +08:00
Benjamin Kitor
82eb61dd4c
[misc] use tqdm.auto where appropriate ( #16290 )
...
Signed-off-by: Benjamin Kitor <bkitor@gigaio.com >
2025-04-09 21:54:54 -07:00
Roger Wang
0d4d06fe2f
[CI][Bugfix] Pin triton version for CPU ( #16384 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-10 04:35:00 +00:00
Jintao
4aed0ca6a2
[bugfix] Avoid the time consumption caused by creating dummy videos. ( #16371 )
2025-04-10 04:30:05 +00:00
Chengji Yao
1621b25288
[TPU] Fix dummy loading OOM ( #16372 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-10 04:06:16 +00:00
Aaron Ang
a564797151
[Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral ( #16325 )
...
Signed-off-by: Aaron Ang <aaron.angyd@gmail.com >
2025-04-09 20:07:40 -07:00
Guillaume Calmettes
1da6a09274
[Bugfix]: do not shutdown server if skip_special_use=False for MistralTokenizer ( #14094 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-09 19:43:09 -07:00
Yuxuan Zhang
1e44ffc3ff
Add GLM-4-0414 support ( #16338 )
...
Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com >
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Ajay Vohra <ajayvohr@amazon.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
Co-authored-by: Accelerator1996 <lvfei.lv@alibaba-inc.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: yihong <zouzou0208@gmail.com >
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
Co-authored-by: ajayvohra2005 <ajayvohr@amazon.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-10 09:19:42 +08:00
Chengji Yao
a454748544
[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues ( #16275 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-09 18:51:51 -06:00
Reid
1bff42c4b7
[Misc] refactor Structured Outputs example ( #16322 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-09 23:32:42 +00:00
Joe Runde
cb391d85dc
[Hardware] add platform-specific request validation api ( #16291 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-04-09 12:50:01 -07:00
Russell Bryant
fee5b8d37f
[Build/CI] Add tracing deps to vllm container image ( #15224 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-09 19:14:06 +00:00
Michael Goin
b2ce859bd2
Fix benchmark_throughput.py --backend=hf ( #16352 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-09 19:09:28 +00:00
Chendi.Xue
566f10a929
[CI]Fix hpu docker and numpy version for CI ( #16355 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-04-09 17:52:26 +00:00
Guillaume Calmettes
c3b5189137
[Bugfix] catch AssertionError in MistralTokenizer as ValueError ( #16344 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-09 17:33:24 +00:00
zh Wang
a25866ac8d
[Bugfix] Fix profiling.py ( #16202 )
...
Signed-off-by: zh Wang <rekind133@outlook.com >
2025-04-09 17:03:34 +00:00
Michael Goin
098900d7c2
Revert "Update label-tpu mergify and remove removal bot" ( #16350 )
2025-04-09 07:59:36 -07:00
Guillaume Calmettes
98d01d3ce2
[Bugfix][Frontend] respect provided default guided decoding backend ( #15476 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-09 05:11:10 -07:00
Nicolò Lucchesi
d55244df31
[Model] Add SupportsMultiModal.get_language_model interface ( #16007 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-09 04:12:54 -07:00
yihong
04149cce27
[BugFix] fix some typos found by typos. ( #16314 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-09 03:43:59 -07:00
ajayvohra2005
24834f4894
update neuron config ( #16289 )
...
Signed-off-by: Ajay Vohra <ajayvohr@amazon.com >
2025-04-09 03:43:22 -07:00
Lucia Fang
ec7da6fcf3
[BugFix] llama4 qknorm should be not shared across head ( #16311 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-04-09 00:59:14 -07:00
yihong
819d548e8a
[BugFix] logger is not callable ( #16312 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-09 00:59:02 -07:00
Michael Goin
477d2a8aa2
Update label-tpu mergify and remove removal bot ( #16298 )
2025-04-09 07:56:25 +00:00
Cyrus Leung
e484e02857
[Bugfix] Avoid transferring cached multi-modal items from P0 to P1 ( #16273 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-09 00:51:27 -07:00
Accelerator1996
24f6b9a713
[Misc] Fix test_sharded_state_loader.py( #16004 ) ( #16005 )
...
Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com >
2025-04-09 14:47:30 +08:00
Luka Govedič
9cdde47289
[BugFix] Fix fusion test and add them to CI ( #16287 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-04-08 23:46:45 -07:00
Chengji Yao
b1eb4ca152
[TPU] Update PyTorch/XLA ( #16288 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-09 14:46:32 +08:00
Michael Goin
87b4ac56c2
[CI][Bugfix] Fix bad tolerance for test_batch_base64_embedding ( #16221 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-09 04:14:46 +00:00
Russell Bryant
cb84e45ac7
[Core] Upgrade to xgrammar 0.1.18, add cache size limit ( #16283 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-08 19:13:22 -07:00
rongfu.leng
4716377fbc
[Feature] Estimate max-model-len use available KV cache memory ( #16168 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-08 19:12:51 -07:00
rongfu.leng
4e9cf8c1dd
[Bugfix] fix gettid method is not define ( #16084 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-08 19:12:44 -07:00
TJian
2976dc27e9
[Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs ( #16198 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
Co-authored-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com >
2025-04-08 19:12:34 -07:00
Chauncey
102bf967f0
[Model] Add smolvlm support ( #16017 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-08 19:12:17 -07:00
yueshen2016
1f4b09b525
Add support to modelopt quantization of Mixtral model ( #15961 )
...
Signed-off-by: Yue <yueshen@nvidia.com >
2025-04-09 01:53:31 +00:00
Jee Jee Li
86c3369eb8
[CI/Build] Fix CI LoRA failure ( #16270 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-09 09:13:56 +08:00
Russell Bryant
2755c34a8f
[V1] Update structured output offline inference example ( #15721 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-08 22:34:09 +00:00
Jinzhen Lin
db10422184
[Bugfix] fix deepseek fp16 scale bug ( #14809 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-08 16:56:09 -04:00
Lucas Wilkinson
e1a2c699dd
[BugFix] Fix Llama4 - Index Error When Single Request Near Max Context ( #16209 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-08 18:56:51 +00:00
Harry Mellor
0115ccd5c0
Add warning that content below line in template will be removed ( #16276 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-08 18:18:40 +00:00
Isotr0py
40b4284fe3
[Bugfix] Handle process_weights_after_loading for QKVCrossParallelLinear ( #15328 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-08 10:02:23 -07:00
Cyrus Leung
4ebc0b9640
[Bugfix] Proper input validation for multi-modal encoder-decoder models ( #16156 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-08 09:45:21 -07:00
Kero Liang
dc96fd54c6
[Misc] Avoid stripping meaningful whitespace from nvidia-smi topo -m output in collect_env.py ( #16272 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-04-08 16:08:09 +00:00
wang.yuqi
1f5d13ab9f
[New Model]: jinaai/jina-embeddings-v3 ( #16120 )
2025-04-08 08:39:12 -07:00
Harry Mellor
90cb44eb02
Update to transformers==4.51.1 ( #16257 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-08 06:53:39 -07:00
Kebe
e11880deea
[Bugfix] Remove triton do_bench fast_flush arg ( #16256 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-04-08 13:51:06 +00:00
TY-AMD
9351f91be9
[BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm ( #16247 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
2025-04-08 05:10:26 -07:00
rongfu.leng
5a1e1c8353
[Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe ( #16203 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-08 04:05:47 -07:00
Alex Brooks
69ecaa7c79
[Misc] Add warning for multimodal data in LLM.beam_search ( #16241 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-08 04:05:27 -07:00
Reid
7f00899ff7
[Misc] format and refactor some examples ( #16252 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-08 10:42:32 +00:00
Simon Mo
995e3d1f41
[Docs] Add Slides from Singapore Meetup ( #16213 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-04-08 07:20:22 +00:00
Kebe
b4ac449a83
[Misc] Merge the logs of pp layers partitions ( #16225 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-04-08 00:18:15 -07:00
Michael Goin
8e5314a468
[V1] Add disable_chunked_mm_input arg to disable partial mm input prefill ( #15837 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-07 23:24:07 -07:00
Siyuan Liu
87918e40c4
[torch.compile][TPU] Make @support_torch_compile work for XLA backend ( #15782 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-08 14:23:53 +08:00
Isotr0py
f6b32efb7f
[Bugfix] Fix and reorganize broken GGUF tests and bump gguf version ( #16194 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-08 13:38:13 +08:00
Michael Goin
b99733d092
[Bugfix] Do not skip "empty" parts of chats that are parsable ( #16219 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-08 05:14:15 +00:00
Yong Hoon Shin
05a015d6a5
Add warning for Attention backends that do not support irope yet ( #16212 )
2025-04-08 03:59:26 +00:00
zxfan-cpu
ad971af8c7
[Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 ( #16161 )
2025-04-07 20:48:47 -07:00
Roger Wang
f2ebb6f541
[V1] Scatter and gather placeholders in the model runner ( #16076 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
2025-04-08 10:43:41 +08:00
Satyajith Chilappagari
1d01211264
Update BASE_IMAGE to 2.22 release of Neuron ( #16218 )
2025-04-07 19:11:18 -07:00
Miles Williams
f94ab12f79
[Misc] Update compressed-tensors to version 0.9.3 ( #16196 )
...
Signed-off-by: Miles Williams <42222518+mlsw@users.noreply.github.com >
2025-04-07 19:09:06 -07:00
youkaichao
a865bc1ca6
[core] do not send error across process ( #16174 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-07 19:09:03 -07:00
Michael Goin
21802c4b6d
[ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping ( #16031 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-04-07 21:28:14 -04:00
Driss Guessous
652907b354
Torchao ( #14231 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-04-07 19:39:28 -04:00
leon-seidel
24f1c01e0f
[Bugfix][V0] XGrammar structured output supports Enum ( #15878 )
...
Signed-off-by: Leon Seidel <leon.seidel@fau.de >
2025-04-07 22:38:25 +00:00
Reid
fad6e2538e
[Misc] add description attribute in CLI ( #15921 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-07 22:30:35 +00:00
Nick Hill
7f6d47c1a2
[V1][BugFix] Exit properly if engine core fails during startup ( #16137 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-07 15:30:15 -07:00
Benjamin Chislett
3147586ebd
[Bugfix] Fix guidance backend for Qwen models ( #16210 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-04-07 22:15:43 +00:00
Roger Wang
ed636d99ca
[Misc] Move Llama 4 projector call into encoder execution ( #16201 )
2025-04-07 14:02:05 -07:00
Nicolò Lucchesi
090c856d76
[Misc] Human-readable max-model-len cli arg ( #16181 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-04-07 14:40:58 -04:00
Gregory Shtrasberg
ad434d4cfe
Print the warning only once ( #16193 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-07 18:30:06 +00:00
Cyrus Leung
66d433b94f
[V1] Revert the default max_num_seqs to V0 values for most hardware ( #16158 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 13:54:36 -04:00
Cyrus Leung
027b204ff1
[Bugfix] Re-enable support for ChatGLMForConditionalGeneration ( #16187 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 23:15:58 +08:00
Lu Fang
55dcce91df
Upstream Llama4 Support to Main ( #16113 )
...
Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com >
Signed-off-by: Chris Thi <chris.c.thi@gmail.com >
Signed-off-by: drisspg <drisspguessous@gmail.com >
Signed-off-by: Jon Swenson <jmswen@gmail.com >
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
Signed-off-by: Lu Fang <fanglu@meta.com >
Signed-off-by: Xiaodong Wang <xdwang@meta.com >
Signed-off-by: Yang Chen <yangche@fb.com >
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Lu Fang <lufang@fb.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Lucia Fang <fanglu@fb.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 08:06:27 -07:00
Robin
8017c8db7f
[Doc]Update image to latest version ( #16186 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-04-07 14:17:39 +00:00
Reid
dc3529dbf6
[Misc] improve example mlpspeculator and llm_engine_example ( #16175 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-07 11:53:52 +00:00
YamPengLi
7699258ef0
[Model] Add Qwen3 and Qwen3MoE ( #15289 )
...
Signed-off-by: YamPengLi <yampayne.lyp@alibaba-inc.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-07 04:06:41 -07:00
Shanshan Shen
e9ba99f296
[V1][Structured Output] Add supports_structured_output() method to Platform ( #16148 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-07 11:06:24 +00:00
Isotr0py
7c80368710
[VLM] Florence-2 supports online serving ( #16164 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-07 04:04:02 -07:00
yihong
95d63f38c0
doc: fix some typos in doc ( #16154 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-07 05:32:06 +00:00
Roger Wang
bb8dab821e
[CI] Set max transformers version for Ultravox model test ( #16149 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-07 04:37:58 +00:00
Isotr0py
fc0f87768a
[Bugfix] Make dummy encoder prompt padding alternative and add missing warnings ( #16129 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-07 04:07:15 +00:00
Cyrus Leung
0a57386721
[Misc] Update Mistral-3.1 example ( #16147 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 03:57:37 +00:00
Woosuk Kwon
3749e28774
[V1][Minor] Minor simplification for get_computed_blocks ( #16139 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-06 20:38:12 -07:00
Kay Yan
86fc2321ff
[Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token ( #15202 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-04-06 20:34:51 -07:00
Martin Hoyer
2549c0dfef
Fix requires-python ( #16132 )
2025-04-06 19:22:25 -07:00
Woosuk Kwon
b10e519895
[V1][Minor] Optimize get_cached_block ( #16135 )
2025-04-06 20:48:14 +00:00
Chengji Yao
9bde5ba127
[TPU] Update PyTorch/XLA ( #16130 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-06 18:25:55 +00:00
Reid
72c8f1ad04
[Misc] update requires-python in pyproject.toml ( #16116 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-06 14:56:34 +00:00
paolovic
da224daaa9
[Bugfix] add hf_token to EngineArgs ( #16093 )
...
Signed-off-by: paolovic <paul-philipp.luley@uzh.ch >
Co-authored-by: paolovic <paul-philipp.luley@uzh.ch >
2025-04-06 14:47:33 +00:00
Varun Sundar Rabindranath
3a100b9278
[Bugfix] LoRA : Fix the order in which the kernels process LoRAs ( #16040 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-04-06 14:04:50 +00:00
rongfu.leng
242a637aea
[Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 ( #16103 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-06 05:52:01 -07:00
Isotr0py
c2a9671510
[Misc] Improve model redirect to accept json dictionary ( #16119 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-06 05:51:45 -07:00
Paul Schweigert
d5ae4f7f42
[Doc][Bugfix] Add missing EOF in k8s deploy doc ( #16025 )
2025-04-06 12:10:57 +00:00
Reid
b6c502a150
[Misc] refactor example eagle ( #16100 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-06 09:42:48 +00:00
Roger Wang
9ca710e525
[CI][V1] Fix passing tokenizer as kwarg to validate_guidance_grammar ( #16117 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-06 16:18:00 +08:00
Ben Jackson
eb07c8cb5b
[Frontend] Fix typo in tool chat templates for llama3.2 and toolace ( #14501 )
...
Signed-off-by: Ben Jackson <ben@ben.com >
2025-04-06 07:44:36 +00:00
Hyesoo Yang
ba10801961
[Benchmark] Add sampling parameters to benchmark_serving. ( #16022 )
...
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com >
2025-04-06 12:30:35 +08:00
Lucia Fang
620fc2d09e
[Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 ( #16112 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-04-05 21:23:40 -07:00
Jonghyun Choe
29283eaa7e
[Model] use AutoWeightsLoader for phi, gemma, deepseek ( #16088 )
...
Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com >
2025-04-05 20:34:38 -07:00
Jinzhen Lin
2fa66ef713
[Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine ( #15946 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-04-05 20:04:22 -07:00
Chauncey
13affc432d
[Misc] Remove redundant code ( #16098 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-05 20:03:50 -07:00
Reid
d8f094a92a
[Misc] format output for encoder_decoder.py ( #16095 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-05 19:57:18 -07:00
Harry Mellor
97ae6d777f
Fix some capitalisations in generated examples doc titles ( #16094 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-05 13:44:03 +00:00
yihong
6baeee70d1
Revert "doc: add info for macos clang errors ( #16049 )" ( #16091 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-05 11:51:51 +00:00
Reid
d2517a4939
[doc] fix 404 ( #16082 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-05 11:39:18 +00:00
yihong
6342adc438
fix: support clang17 for macos and fix the real libomp ( #16086 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-05 11:00:12 +00:00
Kevin H. Luu
0adba91547
[CI] Fix benchmark script level ( #16089 )
2025-04-05 03:36:01 -07:00
Tristan Leclercq
4285e423a6
[Misc] Auto detect bitsandbytes pre-quantized models ( #16027 )
...
Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com >
2025-04-04 23:30:45 -07:00
Woosuk Kwon
63375f0cdb
[V1][Spec Decode] Update N-gram Proposer Interface ( #15750 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-04 16:32:54 -07:00
Michael Goin
70ad3f9e98
[Bugfix][TPU] Fix V1 TPU worker for sliding window ( #16059 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-04-04 23:31:19 +00:00
bnellnm
d6fc629f4d
[Kernel][Minor] Re-fuse triton moe weight application ( #16071 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-04-04 23:27:34 +00:00
Roger Wang
af51d80fa1
Revert "[V1] Scatter and gather placeholders in the model runner" ( #16075 )
2025-04-04 14:50:57 -07:00
Cyrus Leung
f5722a5052
[V1] Scatter and gather placeholders in the model runner ( #15712 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-04-04 21:26:44 +00:00
Nick Hill
651cf0fec1
[V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue ( #15906 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-04 12:56:43 -07:00
Kevin H. Luu
4dc52e1c53
[CI] Reorganize .buildkite directory ( #16001 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2025-04-04 12:16:20 -07:00
Michael Goin
4708f13a9c
[Bugfix] Fix default behavior/fallback for pp in v1 ( #16057 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-04 17:58:08 +00:00
Gregory Shtrasberg
a6d042df0a
[ROCm][Bugfix] Bring back fallback to eager mode removed in #14917 , but for ROCm only ( #15413 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-04 09:40:37 -07:00
Gregory Shtrasberg
40a36ccfeb
[ROCm][Bugfix] Use platform specific FP8 dtype ( #15717 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-04 09:40:20 -07:00
Ilya Markov
ef608c37a7
[Distributed] [ROCM] Fix custom allreduce enable checks ( #16010 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-04-04 09:39:08 -07:00
Li, Jiang
2386803f2a
[CPU] Change default block_size for CPU backend ( #16002 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-04-04 09:39:05 -07:00
Ziji Shi (Steven)
95862f7b4d
[Benchmark][Doc] Update throughput benchmark and README ( #15998 )
...
Signed-off-by: StevenShi-23 <shi.ziji.sm@gmail.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-04-04 09:39:02 -07:00
Isotr0py
230b131b54
[Bugfix][kernels] Fix half2float conversion in gguf kernels ( #15995 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-04 09:38:58 -07:00
liuzhenwei
0812d8dd41
[Hardware][Gaudi][BugFix] fix arguments of hpu fused moe ( #15945 )
...
Signed-off-by: zhenwei <zhenweiliu@habana.ai >
2025-04-04 09:38:55 -07:00
Jonghyun Choe
bf7e3c51ae
[Model] use AutoWeightsLoader for baichuan, gpt-neox, mpt ( #15939 )
...
Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com >
2025-04-04 09:38:52 -07:00
Mark McLoughlin
a35a8a8392
[V1][Spec Decode] Avoid logging useless nan metrics ( #16023 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-04-04 08:52:41 -07:00
yihong
4ef0bb1fcf
doc: add info for macos clang errors ( #16049 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-04 14:58:16 +00:00
Chengji Yao
fadc59c0e6
[TPU][V1] Remove ragged attention kernel parameter hard coding ( #16041 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-04 07:48:50 -04:00
Reid
86cbd2eee9
[Misc] improve gguf check ( #15974 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-04 01:33:36 +00:00
Huy Do
092475f738
[ROCm] Tweak the benchmark script to run on ROCm ( #14252 )
2025-04-03 17:12:48 -07:00
bnellnm
dcc56d62da
[Bugfix] Fix function names in test_block_fp8.py ( #16033 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-04-03 23:01:34 +00:00
Robert Shaw
f15e70d906
[TPU] Switch Test to Non-Sliding Window ( #15981 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-04-03 14:28:45 -07:00
iefgnoix
b6be6f8d1e
[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. ( #15732 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-04-03 14:23:28 -07:00
Alexei-V-Ivanov-AMD
03a70eacaf
Re-enable the AMD Testing for the passing tests. ( #15586 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-03 11:05:17 -07:00
yarongmu-google
45b1ff7a25
[Misc][Performance] Advance tpu.txt to the most recent nightly torch … ( #16024 )
2025-04-03 17:32:54 +00:00
bnellnm
15ba07ef25
[Minor] Fused experts refactor ( #15914 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-04-03 10:19:38 -07:00
Liangfu Chen
d2b58ca203
[Neuron][kernel] Fuse kv cache into a single tensor ( #15911 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-04-03 09:51:32 -07:00
Kyle Sayers
82e7e19a6e
[SupportsQuant] Chameleon, Chatglm, Commandr ( #15952 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-04-03 08:25:22 -07:00
Kyle Sayers
421c462948
[SupportsQuant] Bert, Blip, Blip2, Bloom ( #15573 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-04-03 08:23:19 -07:00
yihong
84884cd9ac
fix: tiny fix make format.sh excutable ( #16015 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-03 15:18:05 +00:00
Reid
a43aa183dc
[doc] update contribution link ( #15922 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-03 10:47:31 +00:00
wwl2755
463bbb1835
[Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process ( #15367 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-04-03 07:32:10 +00:00
youkaichao
5e125e74d1
[misc] improve error message for "Failed to infer device type" ( #15994 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-03 14:45:03 +08:00
Ziji Shi (Steven)
06f21ce7a5
[Benchmark] Add AIMO Dataset to Benchmark ( #15955 )
...
Signed-off-by: Ziji Shi <shi.ziji.sm@gmail.com >
Signed-off-by: StevenShi-23 <shi.ziji.sm@gmail.com >
2025-04-03 06:09:18 +00:00
Aleksandr Malyshev
57a810db9c
[ROCM][V0] PA kennel selection when no sliding window provided ( #15982 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-04-03 05:28:44 +00:00
youkaichao
8b664706aa
[bugfix] add seed in torchrun_example.py ( #15980 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-03 12:25:01 +08:00
yihong
37bfee92bf
fix: better error message for get_config close #13889 ( #15943 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-03 03:53:19 +00:00
Aleksandr Malyshev
e73ff24e31
[ROCM][KERNEL] Paged attention for V1 ( #15720 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com >
2025-04-02 19:48:00 -07:00
Nicolò Lucchesi
bd7599d34a
[V1][TPU] Do not compile sampling more than needed ( #15883 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-03 01:36:01 +00:00
Chengji Yao
01b6113659
[TPU] optimize the all-reduce performance ( #15903 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-03 00:25:14 +00:00
Hyesoo Yang
1b84eff03a
[V1][TPU] TPU-optimized top-p implementation (avoids scattering). ( #15736 )
...
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com >
Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b .c.tpu-prod-env-large-adhoc.internal>
2025-04-02 17:18:08 -07:00
Harry Mellor
55acf86bf8
Fix huggingface-cli[hf-xet] -> huggingface-cli[hf_xet] ( #15969 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-02 23:37:30 +00:00
Michael Goin
f021b97993
[V1] Support Mistral3 in V1 ( #15950 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-02 15:36:24 -07:00
youkaichao
1cab43c2d2
[misc] instruct pytorch to use nvml-based cuda check ( #15951 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-03 01:02:58 +08:00
Nishidha
8bd651b318
Restricted cmake to be less than version 4 as 4.x breaks the build of… ( #15859 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-04-02 16:19:39 +00:00
Jee Jee Li
58e234a754
[Misc] V1 LoRA support CPU offload ( #15843 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-02 23:04:43 +08:00
rongfu.leng
e86c414d6a
[Model] use AutoWeightsLoader in model load_weights ( #15770 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-02 07:47:31 -07:00
Li, Jiang
550b2801ad
[CPU][Bugfix] Using custom allreduce for CPU backend ( #15934 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-04-02 07:46:47 -07:00
Matthias Matt
cefb9e5a28
[Frontend] Implement Tool Calling with tool_choice='required' ( #13483 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
Signed-off-by: Matt, Matthias <matthias.matt@tuwien.ac.at >
Co-authored-by: Liangfu Chen <liangfc@amazon.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-04-02 07:45:45 -07:00
Mark McLoughlin
98d7367b61
[Metrics] Hide deprecated metrics ( #15458 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-04-02 07:37:19 -07:00
Chauncey
594a8b9030
[Bugfix] Fix the issue where the model name is empty string, causing no response with the model name. ( #15938 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-02 06:33:52 -07:00
Kay Yan
44f990515b
[CI] Remove duplicate entrypoints-test ( #15940 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-04-02 02:44:01 -07:00
Brayden Zhong
252937806c
[Bugfix][Benchmarks] Ensure async_request_deepspeed_mii uses the OpenAI choices key ( #15926 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-04-02 02:19:35 -07:00
Harry Mellor
51826d51fa
Add minimum version for huggingface_hub to enable Xet downloads ( #15873 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-02 02:03:36 -07:00
Russell Bryant
14e53ed11f
[V1] Fix json_object support with xgrammar ( #15488 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-02 02:00:08 -07:00
Eric Tang
ddb94c2605
[core] Add tags parameter to wake_up() ( #15500 )
...
Signed-off-by: Eric <erictang000@gmail.com >
2025-04-02 01:59:27 -07:00
LukasBluebaum
90969fb39a
[Kernel] Add more dtype support for GGUF dequantization ( #15879 )
...
Signed-off-by: lukas.bluebaum <lukas.bluebaum@aleph-alpha.com >
2025-04-02 01:58:48 -07:00
Chris Thi
101f1481f9
[Build/CI] Update lm-eval to 0.4.8 ( #15912 )
...
Signed-off-by: Chris Thi <chris.c.thi@gmail.com >
2025-04-02 01:47:57 -07:00
Thien Tran
2edc87b161
[Bugfix] Fix cache block size calculation for CPU MLA ( #15848 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-04-02 01:45:02 -07:00
Jee Jee Li
4203926f10
[CI/Build] Further clean up LoRA tests ( #15920 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-02 01:39:09 -07:00
Chauncey
cdb57015a7
[Misc] Replace print with logger ( #15923 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-02 01:37:38 -07:00
Li Wang
aa557e6422
[Benchmark]Fix error message ( #15866 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-04-02 01:32:24 -07:00
Roger Wang
0e00d40e4f
[V1][Bugfix] Fix typo in MoE TPU checking ( #15927 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-01 23:46:42 -07:00
chun
c920e01242
[Doc] Update rocm.inc.md ( #15917 )
...
Signed-off-by: chun37 <chun.jb.37@gmail.com >
2025-04-01 23:38:26 -07:00
Woosuk Kwon
274d8e8818
[V1][Minor] Enhance SpecDecoding Metrics Log in V1 ( #15902 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-01 23:38:02 -07:00
Thien Tran
2039c6305b
[Bugfix] Fix imports for MoE on CPU ( #15841 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-04-02 03:33:55 +00:00
Brayden Zhong
6efb195a6e
[V1] Fix: make sure k_index is int64 for apply_top_k_only ( #15907 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-04-01 19:06:44 -07:00
Ekagra Ranjan
24b7fb455a
[Spec Decode] Fix input triton kernel for eagle ( #15909 )
2025-04-01 18:15:14 -07:00
Simon Mo
58f5a59769
[Docs] Add Intel as Sponsor ( #15913 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-04-01 17:16:55 -07:00
Simon Mo
db9dfcfa6a
[Docs] Add Ollama meetup slides ( #15905 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-04-01 13:58:59 -07:00
Gerald
9ef98d527e
[Model][MiniMaxText01] Support MiniMaxText01 model inference ( #13454 )
...
Signed-off-by: qscqesze <475517977@qq.com >
Co-authored-by: qingjun <qingjun@minimaxi.com >
Co-authored-by: qscqesze <475517977@qq.com >
2025-04-01 16:23:55 -04:00
yihong
93491aefc7
[BugFix] make sure socket close ( #15875 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-01 13:10:24 -07:00
Simon Mo
7acd539cd7
[Docs] update usage stats language ( #15898 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-04-01 12:54:13 -07:00
Woosuk Kwon
e75a6301bd
[V1][Spec Decode] Implement Eagle Proposer [1/N] ( #15729 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-01 12:33:16 -07:00
Mark McLoughlin
a79cc68b3a
[V1][Metrics] Initial speculative decoding metrics ( #15151 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-04-01 10:45:04 -07:00
Roger Wang
7e3f7a4ee7
[CI] Disable flaky structure decoding test temporarily. ( #15892 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-01 17:42:34 +00:00
cloud11665
9ec8257914
[Model] Add module name prefixes to gemma3 ( #15889 )
...
Signed-off-by: Bartholomew Sabat <bartek@recursal.ai >
Co-authored-by: Bartholomew Sabat <bartek@recursal.ai >
2025-04-01 10:13:40 -07:00
Jennifer Zhao
38327cf454
[Model] Aya Vision ( #15441 )
...
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-04-01 16:30:43 +00:00
Jee Jee Li
dfa82e2a3d
[CI/Build] Clean up LoRA tests ( #15867 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-01 16:28:50 +00:00
bnellnm
e59ca942f5
Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. ( #13932 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-04-01 12:07:43 -04:00
Gregory Shtrasberg
a57a3044aa
[ROCm][Build][Bugfix] Bring the base dockerfile in sync with the ROCm fork ( #15820 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-01 08:56:39 -07:00
Isotr0py
4e5a0f6ae2
[Misc] Allow using OpenCV as video IO fallback ( #15055 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-01 15:55:13 +00:00
Harry Mellor
b63bd14999
Reinstate format.sh and make pre-commit installation simpler ( #15890 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-01 15:41:30 +00:00
chaow-amd
2041c0e360
[Doc] Quark quantization documentation ( #15861 )
...
Signed-off-by: chaow <chaow@amd.com >
2025-04-01 08:32:45 -07:00
wang.yuqi
085cbc4f9f
[New Model]: jinaai/jina-reranker-v2-base-multilingual ( #15876 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-01 08:32:26 -07:00
Harry Mellor
2b93162fb0
Remove format.sh as it's been unsupported >70 days ( #15884 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-01 22:27:46 +08:00
Reid
2e45bd29fe
[Misc] remove unused script ( #15746 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-01 13:58:05 +00:00
Michael Goin
51d7c6a2b2
[Model] Support Mistral3 in the HF Transformers format ( #15505 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-01 06:10:05 -07:00
Yang Chen
f3aca1ee30
setup correct nvcc version with CUDA_HOME ( #15725 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-04-01 06:09:40 -07:00
Rui Qiao
8dd41d6bcc
[Misc] Use envs.VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE ( #15831 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-01 06:07:53 -07:00
Isotr0py
0a298ea418
[Bugfix] Fix no video/image profiling edge case for MultiModalDataParser ( #15828 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-01 18:17:11 +08:00
Harry Mellor
d330558bab
[Docs] Fix small error in link text ( #15868 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-01 10:05:14 +00:00
shangmingc
656fd72976
[Misc] Fix speculative config repr string ( #15860 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-04-01 02:26:22 -07:00
Varun Sundar Rabindranath
79455cf421
[Misc] Enable V1 LoRA by default ( #15320 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-04-01 16:53:56 +08:00
Wei Zeng
30d6a015e0
[Feature] specify model in config.yaml ( #15798 )
...
Signed-off-by: weizeng <weizeng@roblox.com >
2025-04-01 01:20:06 -07:00
yihong
8af5a5c4e5
fix: can not use uv run collect_env close #13888 ( #15792 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-01 07:45:49 +00:00
Chen Zhang
3a5f0afcd2
[V1] Implement sliding window attention in kv_cache_manager ( #14097 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-01 00:33:17 -07:00
Gregory Shtrasberg
c7e63aa4d8
[ROCm] Use device name in the warning ( #15838 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-01 00:10:48 -07:00
Lionel Villard
4a9ce1784c
[sleep mode] clear pytorch cache after sleep ( #15248 )
...
Signed-off-by: <villard@us.ibm.com >
2025-03-31 22:58:58 -07:00
Alexander Matveev
7e4e709b43
[V1] TPU - Fix fused MOE ( #15834 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-31 22:58:07 -07:00
Alexey Kiryushin
63d8eabed0
[Bugfix]: Fix is_embedding_layer condition in VocabParallelEmbedding ( #15824 )
...
Signed-off-by: alexwl <alexey.a.kiryushin@gmail.com >
2025-03-31 22:57:59 -07:00
Percy
e830b01383
[Bugfix] Fix extra comma ( #15851 )
...
Signed-off-by: haochengxia <xhc_1007@163.com >
2025-03-31 22:57:28 -07:00
Yan Ma
ff6473980d
[Bugfix][Model] fix mllama multi-image ( #14883 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-03-31 22:53:37 -07:00
Kinfey
a164aea35d
[Frontend] Add Phi-4-mini function calling support ( #14886 )
...
Signed-off-by: Kinfey <kinfeylo@microsoft.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-03-31 22:50:05 -07:00
Harry Mellor
a76f547e11
Rename fallback model and refactor supported models section ( #15829 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-31 22:49:41 -07:00
Ilya Markov
b7b7676d67
[Distributed] Add custom allreduce support for ROCM ( #14125 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-03-31 22:49:12 -07:00
Harry Mellor
e6e3c55ef2
Move dockerfiles into their own directory ( #14549 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-31 13:47:32 -07:00
Mark McLoughlin
f98a4920f9
[V1][Core] Remove unused speculative config from scheduler ( #15818 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-31 19:15:21 +00:00
Harry Mellor
d4bfc23ef0
Fix Transformers backend compatibility check ( #15290 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-31 10:27:07 -07:00
Alexander Matveev
9a2160fa55
[V1] TPU CI - Add basic perf regression test ( #15414 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-31 13:25:20 -04:00
yihong
2de4118243
fix: change GB to GiB in logging close #14979 ( #15807 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-03-31 10:00:50 -07:00
shangmingc
239b7befdd
[V1][Spec Decode] Remove deprecated spec decode config params ( #15466 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-03-31 09:19:35 -07:00
Cyrus Leung
09e974d483
[Bugfix] Check dimensions of multimodal embeddings in V1 ( #15816 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-31 09:01:35 -07:00
Harry Mellor
e5ef4fa99a
Upgrade transformers to v4.50.3 ( #13905 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-31 08:59:37 -07:00
Mrm
037bcd942c
[Bugfix] Fix missing return value in load_weights method of adapters.py ( #15542 )
...
Signed-off-by: noc-turne <2270929247@qq.com >
2025-03-31 06:56:42 -07:00
Alex Brooks
c2e7507ad4
[Bugfix] Fix Crashing When Loading Modules With Batchnorm Stats ( #15813 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-03-31 13:23:53 +00:00
Naveassaf
3aa2b6a637
[Model] Update support for NemotronNAS models ( #15008 )
...
Signed-off-by: Nave Assaf <nassaf@nvidia.com >
2025-03-31 20:35:14 +08:00
youkaichao
555aa21905
[V1] Fully Transparent Implementation of CPU Offloading ( #15354 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-31 20:22:34 +08:00
yihong
e7ae3bf3d6
fix: better install requirement for install in setup.py ( #15796 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-03-31 05:13:32 -07:00
Harry Mellor
b932c048ac
Recommend developing with Python 3.12 in developer guide ( #15811 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-31 11:54:49 +00:00
Charlie Fu
e85829450d
[Feature][ROCm]Enable fusion pass for torch.compile on ROCm ( #15050 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-03-31 04:42:18 -07:00
Jennifer Zhao
effc5d24fa
[Benchmark] Update Vision Arena Dataset and HuggingFaceDataset Setup ( #15748 )
...
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
2025-03-31 15:38:58 +08:00
Chengyang LIU
18ed3132d2
[Misc] update the comments ( #15780 )
...
Signed-off-by: chengyang liu <lcy4869@gmail.com >
Co-authored-by: chengyang liu <lcy4869@gmail.com >
2025-03-30 19:39:56 -07:00
Woosuk Kwon
9b459eca88
[V1][Scheduler] Avoid calling _try_schedule_encoder_inputs for every request ( #15778 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-30 14:10:42 -07:00
yihong
70fedd0f79
fix: Comments to English for better dev experience ( #15768 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-03-30 10:47:57 -07:00
kYLe
bb103b29bf
[Bugfix] Added embed_is_patch mask for fuyu model ( #15731 )
...
Signed-off-by: Kyle Huang <kylhuang@nvidia.com >
2025-03-30 03:45:08 -07:00
yihong
248e76c4df
fix: lint fix a ruff checkout syntax error ( #15767 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-03-30 03:36:02 -07:00
Cyrus Leung
803d5c35f3
[V1] Override mm_counts for dummy data creation ( #15703 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-30 03:20:42 -07:00
pansicheng
7fd8c0f85c
fix test_phi3v ( #15321 )
...
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com >
2025-03-30 02:01:34 -07:00
Reid
44c3a5abc3
[doc] update conda to usage link in installation ( #15761 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-30 08:12:13 +00:00
Julien Denize
6909a76201
[Bugfix] Fix Mistral guided generation using xgrammar ( #15704 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2025-03-29 20:20:19 -07:00
Chauncey
045533716b
[CI] xgrammar structured output supports Enum. ( #15757 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-29 20:20:02 -07:00
Isotr0py
3c0ff914ac
[Bugfix] Fix Mllama interleaved images input support ( #15564 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-03-29 18:11:15 +00:00
Woosuk Kwon
2bc4be4e32
[V1][Minor] Simplify rejection sampler's parse_output ( #15741 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-29 09:25:17 -07:00
Roger Wang
c67abd614f
[V1] Support interleaved modality items ( #15605 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-29 06:30:09 -07:00
shangmingc
6fa7cd3dbc
[Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore ( #12957 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-03-29 04:01:46 -07:00
wwl2755
94744ba41a
[V1] [Feature] Collective RPC ( #15444 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-03-29 03:39:14 -07:00
TJian
4965ec42d2
[FEAT] [ROCm] Add AITER int8 scaled gemm kernel ( #15433 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-03-29 03:33:56 -07:00
Reid
73aa7041bf
[doc] update doc ( #15740 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-29 04:27:22 +00:00
yarongmu-google
7c1f760024
[Kernel][TPU][ragged-paged-attn] vLLM code change for PR#8896 ( #15659 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-03-28 21:13:15 -07:00
Nicolò Lucchesi
da461f3cbf
[TPU][V1][Bugfix] Fix w8a8 recompiilation with GSM8K ( #15714 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-28 21:13:06 -07:00
Jinzhen Lin
5b800f0932
[Bugfix] set VLLM_WORKER_MULTIPROC_METHOD=spawn for vllm.entrypoionts.openai.api_server ( #15700 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-03-28 21:12:26 -07:00
cyyever
8427f70493
Use numba 0.61 for python 3.10+ to support numpy>=2 ( #15692 )
...
Signed-off-by: cyy <cyyever@outlook.com >
2025-03-29 12:11:51 +08:00
Russell Bryant
7a7992085b
[CI] Speed up V1 structured output tests ( #15718 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-28 21:10:45 -07:00
Varun Sundar Rabindranath
1286211f57
[Bugfix] LoRA V1: add and fix entrypoints tests ( #15715 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-28 21:10:41 -07:00
Nick Hill
6d531ad7b8
[Misc][V1] Misc code streamlining ( #15723 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-28 20:59:47 -07:00
Ce Gao
762b424a52
[Docs] Document v0 engine support in reasoning outputs ( #15739 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-03-29 03:46:57 +00:00
pengyuange
de1cb38769
[Model] Support Skywork-R1V ( #15397 )
...
Signed-off-by: jiacai.liu <932997367@qq.com >
Co-authored-by: jiacai.liu <932997367@qq.com >
2025-03-28 20:39:21 -07:00
Gregory Shtrasberg
c802f5430d
[ROCm][AMD][Build] Update AMD supported arch list ( #15632 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-03-28 20:39:18 -07:00
simpx
cff8991a50
[Docs][V1] Optimize diagrams in prefix caching design ( #15716 )
2025-03-29 03:33:58 +00:00
daniel-salib
f3f8d8fff4
implement prometheus fast-api-instrumentor for http service metrics ( #15657 )
2025-03-29 00:12:02 +00:00
Reid
26df46ee59
[Misc] cli auto show default value ( #15582 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-03-28 22:23:00 +00:00
Alexander Matveev
c3f687ac22
[V1] TPU - Fix the chunked prompt bug ( #15713 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-28 20:19:04 +00:00
Luka Govedič
04437e313d
[Bugfix] [torch.compile] Add Dynamo metrics context during compilation ( #15639 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-03-28 14:01:09 -06:00
Robert Shaw
038bededba
[TPU] [Perf] Improve Memory Usage Estimation ( #15671 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-03-28 17:37:52 +00:00
shangmingc
d03308be0c
[Misc] Remove stale func in KVTransferConfig ( #14746 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-03-28 17:33:32 +00:00
Cyrus Leung
c6bc0034d0
[Misc] Remove unused utils and clean up imports ( #15708 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-28 09:41:16 -07:00
Woosuk Kwon
70e132244a
[Minor] Remove TGI launching script ( #15646 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-28 09:30:08 -07:00
Michael Goin
47e9038d23
Fix cpu offload testing for gptq/awq/ct ( #15648 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-29 00:29:32 +08:00
Kebe
432cf22a6a
[Bugfix] Fix regex compile display format ( #15368 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-03-28 08:58:44 -07:00
Reid
2914006fe0
[doc] add missing imports ( #15699 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-28 15:56:48 +00:00
Russell Bryant
7329ff5468
[V1] Support disable_any_whtespace for guidance backend ( #15584 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-28 23:46:45 +08:00
Cyrus Leung
541d1df486
[Bugfix] embed_is_patch for Idefics3 ( #15696 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-28 08:27:52 -07:00
Chauncey
3b00ff9138
[Bugfix][v1] xgrammar structured output supports Enum. ( #15594 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-28 06:14:53 -07:00
Jee Jee Li
91276c5721
[Model] Adding torch compile annotations to chatglm ( #15624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-28 21:14:09 +08:00
Harry Mellor
0b4167526d
[Docs] Add "Generation quality changed" section to troubleshooting ( #15701 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-28 13:03:21 +00:00
Reid
fd5fd26902
[Frontend] update priority for --api-key and VLLM_API_KEY ( #15588 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-28 19:40:12 +08:00
Ce Gao
3bbaacbe15
[Bugfix][Frontend] Eliminate regex based check in reasoning full generator ( #14821 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-03-28 11:20:35 +00:00
Lize Cai
a10314c6b3
[Misc] Fix test_sleep to use query parameters ( #14373 )
...
Signed-off-by: Lize Cai <lize.cai@sap.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-28 18:00:14 +08:00
Jee Jee Li
70f2c2a709
[Bugfix] Fix 'InductorAdaptor object has no attribute 'cache_dir' ( #15674 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-28 17:10:40 +08:00
Li, Jiang
280d074103
[CPU][CI] Improve CPU Dockerfile ( #15690 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-03-28 01:36:31 -07:00
Ce Gao
32b14baf8a
[Refactor][Frontend] Keep all logic about reasoning into one class ( #14428 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-03-28 00:23:30 -07:00
Robert Shaw
2d9045fce8
[TPU][CI] Fix TPUModelRunner Test ( #15667 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-03-28 00:01:26 -07:00
Cyrus Leung
355f66348c
[V1] Remove legacy input registry ( #15673 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 23:34:34 -07:00
Cyrus Leung
8693e47e6a
[Bugfix] Fix mm_hashes forgetting to be passed ( #15668 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-28 05:51:05 +00:00
Jason (Siyu) Zhu
cec8c7d7f8
Refactor error handling for multiple exceptions in preprocessing ( #15650 )
...
Signed-off-by: JasonZhu1313 <jasonchu13@outlook.com >
2025-03-28 03:27:20 +00:00
Gregory Shtrasberg
4d0ec37267
[Quantization][FP8] Adding support for fp8 gemm layer input in fp8 ( #14578 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-03-28 02:58:16 +00:00
Chen Xia
e7f720ea56
[Misc]add coding benchmark for speculative decoding ( #15303 )
...
Signed-off-by: CXIAAAAA <cxia0209@gmail.com >
2025-03-28 10:47:05 +08:00
Wes
4ae17bf1e2
Revert "Use Cache Hinting for fused_moe kernel ( #15511 )" ( #15645 )
...
Signed-off-by: Wes Medford <wryanmedford@gmail.com >
2025-03-27 19:45:55 -07:00
Robert Shaw
8a49eea74b
[CI][TPU] Temporarily Disable Quant Test on TPU ( #15649 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-27 19:45:05 -07:00
wwl2755
b4245a48df
[Doc] Fix dead links in Job Board ( #15637 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-03-28 02:43:40 +00:00
Kebe
4e0f6076be
[Bugfix] Fix failure to launch in Tensor Parallel TP mode on macOS. ( #14948 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-28 10:13:41 +08:00
Jee Jee Li
726efc6a32
[Quantization][V1] BitsAndBytes support V1 ( #15611 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-28 10:12:47 +08:00
Robert Shaw
bd45912b99
[TPU] Lazy Import ( #15656 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-28 09:57:01 +08:00
Nick Hill
15dac210f0
[V1] AsyncLLM data parallel ( #13923 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-27 16:14:41 -07:00
Russell Bryant
112b3e5b3b
[CI] Update rules for applying tpu label. ( #15634 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-27 22:15:26 +00:00
cnorman
32d669275b
Correct PowerPC to modern IBM Power ( #15635 )
...
Signed-off-by: Christy Norman <christy@linux.vnet.ibm.com >
2025-03-27 15:04:32 -07:00
Nicolò Lucchesi
4098b72210
[Bugfix][TPU][V1] Fix recompilation ( #15553 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-27 19:15:06 +00:00
Harry Mellor
46450b8d33
Use absolute placement for Ask AI button ( #15628 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-27 18:52:18 +00:00
Cyrus Leung
13ac9cab21
[Misc] Avoid direct access of global mm_registry in compute_encoder_budget ( #15621 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 17:52:00 +00:00
Yuan Tang
66aa4c0bf4
[Feature] Add middleware to log API Server responses ( #15593 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-03-27 17:49:38 +00:00
Cyrus Leung
247181536f
[Misc] Replace is_encoder_decoder_inputs with split_enc_dec_inputs ( #15620 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 17:36:32 +00:00
Cyrus Leung
07bf813fb5
[Doc] Link to onboarding tasks ( #15629 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 16:30:53 +00:00
Hiroaki Sugiyama
8958217ad5
[Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1 ( #15211 )
...
Signed-off-by: h-sugi <h.sugi@ieee.org >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-27 22:29:29 +08:00
Cyrus Leung
ac5bc615b0
[Model] MiniCPM-V/O supports V1 ( #15487 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 06:07:29 -07:00
Reid
8063dfc61a
[Doc] update --system for transformers installation in docker doc ( #15616 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-27 20:38:46 +08:00
Richard Zou
6278bc829e
Fix incorrect filenames in vllm_compile_cache.py ( #15494 )
...
Signed-off-by: <zou3519@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-27 18:33:41 +08:00
wang.yuqi
3f532cb6a6
[Misc] Use model_redirect to redirect the model name to a local folder. ( #14116 )
2025-03-27 02:21:23 -07:00
Cyrus Leung
e6c9053f9e
[Misc] Clean up scatter_patch_features ( #15559 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 07:45:00 +00:00
Robert Shaw
43ed4143c4
[Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM ( #15587 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com >
Co-authored-by: ElizaWszola <ewszola@redhat.com >
2025-03-27 06:47:25 +00:00
Bella kira
f4c98b4d4c
[Misc] Consolidate LRUCache implementations ( #15481 )
...
Signed-off-by: Bella kira <2374035698@qq.com >
2025-03-27 06:43:43 +00:00
Robert Shaw
e1e0fd7543
[TPU] Avoid Triton Import ( #15589 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-27 06:43:02 +00:00
Rui Qiao
df8d3d1287
[Misc] Restrict ray version dependency and update PP feature warning in V1 ( #15556 )
2025-03-27 06:21:07 +00:00
Chengji Yao
619d3de8bd
[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS ( #15583 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-03-26 22:46:26 -07:00
Gregory Shtrasberg
ecff8309a3
[ROCm] Env variable to trigger custom PA ( #15557 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-03-26 22:46:12 -07:00
Jerry Zhang
dcf2a590f5
Allow torchao quantization in SiglipMLP ( #15575 )
2025-03-26 22:45:51 -07:00
Cody Yu
54aa619459
[V1] Refactor num_computed_tokens logic ( #15307 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-27 04:54:36 +00:00
Mengqing Cao
fb22be5817
[moe][quant] add weight name case for offset ( #15515 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-03-27 04:50:29 +00:00
Wei Zeng
7f301dd8ef
[Doc] Update V1 user guide for fp8 kv cache support ( #15585 )
...
Signed-off-by: weizeng <weizeng@roblox.com >
2025-03-26 19:39:03 -07:00
Varun Sundar Rabindranath
8095341a01
[misc] LoRA: Remove unused long context test data ( #15558 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-27 10:04:51 +08:00
Chenyaaang
69db16a46a
add platform check back ( #15578 )
...
Signed-off-by: Chenyaaang <llccyy1212@gmail.com >
2025-03-27 01:50:27 +00:00
Michael Goin
ce78f9af4e
Add automatic tpu label to mergify.yml ( #15560 )
2025-03-26 21:39:58 -04:00
ElizaWszola
9239bf718e
[Kernel] CUTLASS grouped gemm fp8 MoE kernel ( #13972 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com >
2025-03-27 00:54:44 +00:00
Matthew Vine
7a6d45bc8a
Support FIPS enabled machines with MD5 hashing ( #15299 )
...
Signed-off-by: Matthew Vine <32849887+MattTheCuber@users.noreply.github.com >
2025-03-26 20:19:46 -04:00
Chengji Yao
e74ff409e0
[TPU] support disabling xla compilation cache ( #15567 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-03-27 00:09:28 +00:00
Wes
7a888271f5
Use Cache Hinting for fused_moe kernel ( #15511 )
2025-03-26 23:21:34 +00:00
Alexander Matveev
9d119a86ae
[V1] TPU CI - Fix test_compilation.py ( #15570 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-26 21:51:54 +00:00
Alexander Matveev
b2e85e26f4
[V1] TPU - Revert to exponential padding by default ( #15565 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-26 21:35:05 +00:00
Alexei-V-Ivanov-AMD
dd8a29da99
Applying some fixes for K8s agents in CI ( #15493 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-03-26 20:35:11 +00:00
marko
27df5199d9
Support SHA256 as hash function in prefix caching ( #15297 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-03-26 11:11:28 -07:00
Nick Hill
35fad35a48
[V1][Sampler] Faster top-k only implementation ( #15478 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-26 10:56:47 -07:00
Aaron Pham
733e7c9e95
[Refactor] Remove unnecessary backend parameter in structured output interface ( #15317 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-26 17:51:56 +00:00
Harry Mellor
0af4d764d6
Fix weight loading for some models in Transformers backend ( #15544 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-26 10:17:53 -07:00
youkaichao
e64afa455c
multi-node offline DP+EP example ( #15484 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-26 23:54:24 +08:00
Alex Brooks
1711b929b6
[Model] Add Reasoning Parser for Granite Models ( #14202 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
Co-authored-by: Joe Runde <joe@joerun.de >
2025-03-26 14:28:07 +00:00
Harry Mellor
c091c0a588
Improve validation of TP in Transformers backend ( #15540 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-26 07:26:48 -07:00
cyyever
1aa162e030
Apply torchfix ( #15532 )
...
Signed-off-by: cyy <cyyever@outlook.com >
2025-03-26 12:09:06 +00:00
Harry Mellor
cf5c8f1686
Separate base model from TransformersModel ( #15467 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-03-26 18:13:38 +08:00
Reid
4ec2cee000
[Misc] improve example script output ( #15528 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-26 10:12:47 +00:00
wwl2755
99f536f830
[Misc] Enhance warning information to user-defined chat template ( #15408 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-03-26 02:21:15 -07:00
vllmellm
5ebf66748b
[FEAT][ROCm] Integrate Fused MoE Kernels from AITER ( #14967 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-03-26 16:30:30 +08:00
Bryan Lu
781d056280
[Feature] Enhance EAGLE Architecture with Proper RMS Norms ( #14990 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-26 08:24:07 +00:00
daniel-salib
5aefd6ac31
Fix raw_request extraction in load_aware_call decorator ( #15382 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2025-03-25 22:29:54 -07:00
Varun Sundar Rabindranath
6c663dfd5e
[misc] LoRA - Skip LoRA kernels when not required ( #15152 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-26 11:33:45 +08:00
Lucas Wilkinson
33437bc6e7
[BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results) ( #15492 )
...
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
2025-03-25 20:33:22 -07:00
Tyler Michael Smith
23114d3364
[Misc] Warn about v0 in benchmark_paged_attn.py ( #15495 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-25 20:31:04 -07:00
Cyrus Leung
997c8811d6
[Model] Support multi-image for Molmo ( #15438 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-26 11:26:33 +08:00
Harry Mellor
e42389f9d7
Transformers backend already supports V1 ( #15463 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-25 20:26:16 -07:00
Varun Sundar Rabindranath
ff38f0a32c
[CI/Build] LoRA: Delete long context tests ( #15503 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-25 17:18:34 -07:00
Varun Sundar Rabindranath
a5cfbab3c8
[Core] LoRA: V1 Scheduler optimization ( #15422 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-25 22:50:09 +00:00
Chenyaaang
ac3cd6e83c
[core] add bucket padding to tpu_model_runner ( #14995 )
...
Signed-off-by: Chenyaaang <llccyy1212@gmail.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-25 17:27:22 -04:00
Lu Fang
082ab86f5f
[V1] Support long_prefill_token_threshold in v1 scheduler ( #15419 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-03-25 14:22:26 -07:00
Nick Hill
6aa196c8dc
[V1][Minor] Use SchedulerInterface type for engine scheduler field ( #15499 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-25 14:21:36 -07:00
Nicolò Lucchesi
a0dd7dcd49
[TPU][V1] Fix Sampler recompilation ( #15309 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-25 16:43:54 -04:00
Maximilien de Bayser
e977c11111
Add workaround for shared field_names in pydantic model class ( #13925 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-03-25 20:31:08 +00:00
Joe Runde
5f063a80bd
[bugfix] add supports_v1 platform interface ( #15417 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-03-25 15:00:32 -04:00
Antonio Gómez
5d8e1c9279
[Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility) ( #15471 )
...
Co-authored-by: ServerAI <ai@exc-mad-ai.com >
2025-03-25 17:59:25 +00:00
yarongmu-google
0a049c7d86
[CI/Build] Add tests for the V1 tpu_model_runner. ( #14843 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-03-25 12:27:16 -04:00
youkaichao
d0cfec7ab9
[bugfix] fix inductor cache on max_position_embeddings ( #15436 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-25 07:05:39 -07:00
Szymon Ożóg
a608160027
[Kernel] Fix conflicting macro names for gguf kernels ( #15456 )
...
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com >
2025-03-25 13:50:49 +00:00
Cyrus Leung
3f04a7fbf2
[Doc] Update V1 user guide for multi-modality ( #15460 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-25 11:01:58 +00:00
Cyrus Leung
5994430b84
[Misc] Remove redundant num_embeds ( #15443 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-25 18:27:57 +08:00
Cyrus Leung
a9e879b316
[Misc] Clean up MiniCPM-V/O code ( #15337 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-25 10:22:52 +00:00
Md. Shafi Hussain
3e2f37a69a
Dockerfile.ppc64le changes to move to UBI ( #15402 )
...
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
2025-03-25 10:15:14 +00:00
Thien Tran
4f044b1d67
[Kernel][CPU] CPU MLA ( #14744 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-03-25 09:34:59 +00:00
Siyuan Liu
4157f563b4
[Hardware][TPU][Bugfix] Fix v1 mp profiler ( #15409 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-03-25 01:43:00 -07:00
Lu Fang
051da7efe3
Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 ( #15160 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
Co-authored-by: Richard Barnes <rbarnes@meta.com >
2025-03-25 15:36:45 +08:00
Woosuk Kwon
25f560a62c
[V1][Spec Decode] Update target_logits in place for rejection sampling ( #15427 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-24 21:04:41 -07:00
Russell Bryant
a09ad90a72
[V1] guidance backend for structured output + auto fallback mode ( #14779 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Loc Huynh <jc1da.3011@gmail.com >
Co-authored-by: Michal Moskal <michal@moskal.me >
2025-03-24 21:02:33 -07:00
Chauncey
10b34e36b9
[Bugfix] Fixed the issue of not being able to input video and image simultaneously ( #15387 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-25 03:48:08 +00:00
Tyler Michael Smith
b5269db959
Revert "Fix non-contiguous input passed to Marlin kernel ( #15319 )" ( #15398 )
2025-03-24 20:43:51 -07:00
Jee Jee Li
6db94571d7
[Misc] Remove LoRA log ( #15388 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-24 20:43:48 -07:00
Harry Mellor
97cfa65df7
Add pipeline parallel support to TransformersModel ( #12832 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-03-25 10:41:45 +08:00
Woosuk Kwon
911c8eb000
[Minor][Spec Decode] Remove compiled_softmax ( #15416 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-24 19:09:04 -07:00
Woosuk Kwon
ebcebeeb6b
[V1][Spec Decode] Enable spec decode for top-p & top-k sampling ( #15063 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-24 17:16:46 -07:00
Gregory Shtrasberg
f533b5837f
[ROCm][Kernel] MoE weights padding ( #14454 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: charlifu <charlifu@amd.com >
2025-03-24 23:45:30 +00:00
Gregory Shtrasberg
8279201ce6
[Build] Cython compilation support fix ( #14296 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-03-24 23:37:54 +00:00
Siyuan Liu
23fdab00a8
[Hardware][TPU] Skip failed compilation test ( #15421 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-03-24 23:28:57 +00:00
Nick Hill
623e2ed29f
[BugFix][V1] Quick fix for min_tokens with multiple EOS ( #15407 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-24 15:58:59 -07:00
Nick Hill
9d72daf4ce
[V1][Perf] Simpler request output queues ( #15156 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-24 22:44:08 +00:00
Cyrus Leung
6dd55af6c9
[Doc] Update docs on handling OOM ( #15357 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-24 14:29:34 -07:00
Yuan Tang
3eb08ed9b1
[DOC] Add Kubernetes deployment guide with CPUs ( #14865 )
2025-03-24 10:48:43 -07:00
liuzhenwei
5eeadc2642
[Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral ( #12303 )
...
Signed-off-by: zhenwei <zhenweiliu@habana.ai >
2025-03-24 09:48:40 -07:00
Nick Hill
3aee6573dc
[V1] Aggregate chunked prompt logprobs in model runner ( #14875 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-24 12:27:57 -04:00
Yi Liu
9cc645141d
[MISC] Refine no available block debug msg ( #15076 )
...
Signed-off-by: Yi Liu <yiliu4@habana.ai >
Signed-off-by: yiliu30 <yi4.liu@intel.com >
Co-authored-by: Yi Liu <yiliu4@habana.ai >
2025-03-25 00:01:10 +08:00
Chen1022
0893567db9
[V1][Minor] fix comments ( #15392 )
...
Signed-off-by: chenjincong <chenjincong@baidu.com >
Signed-off-by: Chen-0210 <chenjincong11@gmail.com >
Co-authored-by: chenjincong <chenjincong@baidu.com >
2025-03-24 08:45:32 -07:00
Russell Bryant
8abe69b499
[Core] Don't force uppercase for VLLM_LOGGING_LEVEL ( #15306 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-24 08:27:30 -07:00
Manish Sethi
761702fd19
[Core] Integrate fastsafetensors loader for loading model weights ( #10647 )
...
Signed-off-by: Manish Sethi <Manish.sethi1@ibm.com >
2025-03-24 08:08:02 -07:00
youkaichao
9606d572ed
[distributed] fix dp group ( #15355 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-24 14:54:27 +00:00
Cyrus Leung
cbcdf2c609
[Bugfix] Fix chat template loading ( #15143 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-24 13:50:09 +00:00
Russell Bryant
038de04d7b
Fix zmq IPv6 URL format error ( #15341 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-24 09:30:41 -04:00
Jinzhen Lin
6b3cc75be0
[Kernel] allow non-contiguous input for marlin kernel ( #14658 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-03-24 09:21:33 -04:00
Simon Mo
7ffcccfa5c
Revert "[CI/Build] Use uv python for docker rather than ppa:deadsnakess/ppa ( #13569 )" ( #15377 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-24 05:53:10 -07:00
sfbemerk
cc8accfd53
[Misc] Update guided decoding logs to debug ( #15310 )
...
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
2025-03-24 04:25:20 -07:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
948ab03e7e
[Bugfix][V1] Avoid importing PreTrainedModel ( #15366 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2025-03-24 10:33:12 +00:00
Rui Qiao
5797fb97e9
[Misc] Remove ignore_reinit_error for ray.init() ( #15373 )
2025-03-24 07:41:53 +00:00
Jee Jee Li
3892e58ad7
[Misc] Upgrade BNB version ( #15183 )
2025-03-24 05:51:42 +00:00
Qubitium-ModelCloud
d20e261199
Fix non-contiguous input passed to Marlin kernel ( #15319 )
2025-03-24 03:09:44 +00:00
Luka Govedič
f622dbcf39
[Fix] [torch.compile] Improve UUID system for custom passes ( #15249 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-03-24 01:54:07 +00:00
Lucas Wilkinson
dccf535f8e
[V1] Enable V1 Fp8 cache for FA3 in the oracle ( #15191 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-23 15:07:04 -07:00
Roger Wang
9c5c81b0da
[Misc][Doc] Add note regarding loading generation_config by default ( #15281 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-23 14:00:55 -07:00
Robin
d6cd59f122
[Frontend] Support tool calling and reasoning parser ( #14511 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-03-23 14:00:07 -07:00
Woosuk Kwon
bc8ed3c4ba
[V1][Spec Decode] Use better defaults for N-gram ( #15358 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-23 10:52:30 -07:00
Woosuk Kwon
b9bd76ca14
[V1][Spec Decode] Respect prompt_lookup_max ( #15348 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-23 10:41:44 -07:00
DefTruth
6ebaf9ac71
[Bugfix] consider related env vars for torch.compiled cache hash ( #14953 )
...
Signed-off-by: DefTruth <31974251+DefTruth@users.noreply.github.com >
2025-03-23 15:53:09 +00:00
DefTruth
f90d34b498
[Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20 ( #15322 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-03-23 01:10:10 -07:00
youkaichao
f68cce8e64
[ci/build] fix broken tests in LLM.collective_rpc ( #15350 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-23 14:49:48 +08:00
youkaichao
09b6a95551
[ci/build] update torch nightly version for GH200 ( #15135 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-23 14:04:13 +08:00
shangmingc
50c9636d87
[V1][Usage] Refactor speculative decoding configuration and tests ( #14434 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-03-22 19:28:10 -10:00
hijkzzz
0661cfef7a
Fix v1 supported oracle for worker-cls and worker-extension-cls ( #15324 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-23 10:23:35 +08:00
Chen Zhang
a827aa815d
[doc] Add back previous news ( #15331 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-03-22 17:38:33 -07:00
Russell Bryant
b877031d80
Remove openvino support in favor of external plugin ( #15339 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-22 14:06:39 -07:00
Wang Ran (汪然)
dd861b992f
[BugFix][Typing] Fix Imprecise Type Annotations ( #15208 )
...
Signed-off-by: Wang Ran (汪然) <wrran@outlook.com >
2025-03-22 09:05:03 -07:00
Russell Bryant
eb63ea1e18
[V1] Add disable-any-whitespace option support for xgrammar ( #15316 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-22 15:56:17 +00:00
Naitong Yu
2f4bd358f1
[Model] Support Tele-FLM Model ( #15023 )
...
Signed-off-by: Naitong Yu <ntyu@baai.ac.cn >
Signed-off-by: jiangxin <horizon94@outlook.com >
Co-authored-by: Jason Fang <jasonfang3900@gmail.com >
Co-authored-by: jiangxin <horizon94@outlook.com >
2025-03-22 02:04:44 -07:00
Varun Sundar Rabindranath
8a8b30eac1
[Bugfix] LoRA V0 - Fix case where max_num_seqs is between cudagraph capture sizes ( #15308 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-22 02:03:32 -07:00
Jee Jee Li
2fa0e1396b
[Bugfix] Fix torch.compile raise FileNotFoundError ( #15278 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-22 13:49:34 +08:00
wwl2755
1c2bec0f82
[Doc] add load_format items in docs ( #14804 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-03-21 22:36:43 -07:00
TJian
ec870fba9a
[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature ( #14959 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-03-21 22:36:14 -07:00
Andy Lo
df1430265c
[Bugfix][V0] Multi-sequence logprobs streaming edge case ( #15259 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-03-21 22:35:37 -07:00
Rui Qiao
4c69e228b3
[Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout ( #15301 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-03-21 22:25:43 -07:00
Russell Bryant
790b79750b
[Build/CI] Fix env var typo ( #15305 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-21 22:28:46 +00:00
Nicolò Lucchesi
cfbb8c930f
[TPU][V1] MHA Pallas backend ( #15288 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-21 08:50:39 -07:00
Cyrus Leung
baec0d4de9
Revert "[Feature] specify model in config.yaml ( #14855 )" ( #15293 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-21 08:30:23 -07:00
Mengqing Cao
c21b99b912
[Bugfix][VLM] fix llava processor ( #15285 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-03-21 05:14:36 -07:00
Chen Zhang
93a00d7dde
[v1] Refactor KVCacheConfig ( #14079 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-03-21 04:56:27 -07:00
Russell Bryant
61e8c18350
[Misc] Add cProfile helpers ( #15074 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-21 04:56:09 -07:00
Isotr0py
8afcd0f633
[Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend ( #15282 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-21 11:42:06 +00:00
Lehua Ding
91ca929dc7
[V1] Fix wrong import path of get_flash_attn_version ( #15280 )
...
Signed-off-by: Lehua Ding <lehuading@tencent.com >
2025-03-21 03:54:11 -07:00
Isotr0py
84e00adc8a
[Bugfix] Fix incorrect resolving order for transformers fallback ( #15279 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-21 03:54:08 -07:00
Isotr0py
47c7126213
[Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL ( #15273 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-21 10:32:33 +00:00
Shanshan Shen
a989ca2bf6
[Bugfix] Add int8 torch dtype for KVCache ( #15260 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-03-21 08:58:28 +00:00
Wei Zeng
0fa3970deb
[Feature] specify model in config.yaml ( #14855 )
...
Signed-off-by: weizeng <weizeng@roblox.com >
2025-03-21 00:26:03 -07:00
Nick Hill
da6ea29f7a
[V1] Avoid redundant input processing in n>1 case ( #14985 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-20 22:24:10 -07:00
Edwin Hernandez
7297941b38
[Doc] Update LWS docs ( #15163 )
...
Signed-off-by: Edwinhr716 <Edandres249@gmail.com >
2025-03-20 21:18:47 -07:00
Isotr0py
f8a08cb90d
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs ( #14071 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-21 03:14:19 +00:00
Siyuan Liu
b15fd2be2a
[Hardware][TPU] Add check for no additional graph compilation during runtime ( #14710 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-03-21 03:05:28 +00:00
Woosuk Kwon
e588ac237c
Add an example for reproducibility ( #15262 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-20 19:55:47 -07:00
Cody Yu
5df2da5b97
[Misc] Better RayExecutor and multiprocessing compatibility ( #14705 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-20 19:27:46 -07:00
Woosuk Kwon
11b986b3fb
[Docs] Trim the latest news in README ( #15261 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-20 19:24:21 -07:00
Chih-Chieh Yang
296f927f24
[Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies ( #14857 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-03-20 19:21:08 -07:00
Travis Johnson
0032903a5b
[Bugfix] detect alibi and revert to FA2 ( #15231 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-03-20 19:20:16 -07:00
Hyesoo Yang
47195057e9
[V1][TPU] Speed up top-k on TPU by using torch.topk ( #15242 )
...
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com >
2025-03-20 19:19:40 -07:00
Harry Mellor
6edbfa924d
Mention extra_body as a way top pass vLLM only parameters using the OpenAI client ( #15240 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-20 19:18:36 -07:00
Isotr0py
1e508343e1
[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation ( #15200 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-20 19:18:04 -07:00
Sage Moore
2e0b4cfde0
[ROCM] Upgrade torch to 2.6 ( #15244 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-03-20 19:17:33 -07:00
Jee Jee Li
10f55fe6c5
[Misc] Clean up the BitsAndBytes arguments ( #15140 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-20 19:17:12 -07:00
Lu Fang
d3ccbd6350
Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 ( #15159 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
Co-authored-by: Richard Barnes <rbarnes@meta.com >
2025-03-21 10:01:11 +08:00
Varun Sundar Rabindranath
0cfe7d386d
[CI/Build] LoRA : make add_lora_test safer ( #15181 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-21 09:28:53 +08:00
Woosuk Kwon
0c6f5023c3
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface ( #15250 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-03-20 17:50:43 -07:00
Yu Chin Fabian Lim
06dd08256f
Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. ( #14617 )
...
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-03-21 00:44:37 +00:00
Woosuk Kwon
2b22290ce0
[V1] Add flag to disable cascade attention ( #15243 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-20 15:24:16 -07:00
Jason
d8e82bc06d
[Bugfix] fix V1 Engine crash while handling requests with duplicate request id ( #15043 )
...
Signed-off-by: Jiahui Sun <jhsun2020@gmail.com >
2025-03-20 10:01:02 -07:00
Chi Zhang
086b56824c
[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 ( #15172 )
...
Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-21 00:30:04 +08:00
Harry Mellor
5a0905ba2a
Replace misc issues with link to forum ( #15226 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-20 23:18:20 +08:00
Richard Liu
a8f12a63fd
Fix env vars for running Ray distributed backend on GKE ( #15166 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2025-03-20 14:59:33 +00:00
Harry Mellor
69ae2380c6
Add user forum to README ( #15220 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-20 22:39:51 +08:00
Cyrus Leung
27261e40a6
[Bugfix] Multi-video inference on LLaVA-Onevision ( #15082 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-03-20 14:10:45 +00:00
Quang-Linh LE
e3f813c33b
[macOS] Ugrade pytorch to 2.6.0 ( #15129 )
2025-03-20 01:22:40 -07:00
Wang Ran (汪然)
c607a2652b
Fixing Imprecise Type Annotations ( #15192 )
2025-03-20 01:19:55 -07:00
Kevin H. Luu
3d45e3d749
[release] Tag vllm-cpu with latest upon new version released ( #15193 )
2025-03-20 01:19:10 -07:00
billishyahao
742369d35a
[Frontend][Bugfix] support prefill decode disaggregation on deepseek ( #14824 )
...
Signed-off-by: billishyahao <bill.he@amd.com >
Co-authored-by: Zhai Feiyue <80079571+ZhaiFeiyue@users.noreply.github.com >
2025-03-20 00:00:33 -07:00
Wang Ran (汪然)
bfe2fe0af4
typo: Update config.py ( #15189 )
2025-03-19 23:31:21 -07:00
Matt Ritter
a8652f4f0f
Enable CUDA graph support for llama 3.2 vision ( #14917 )
...
Signed-off-by: Matt Ritter <100659061+mritterfigma@users.noreply.github.com >
2025-03-19 23:29:16 -07:00
Cyrus Leung
2f726b241e
[Doc] Update README.md ( #15187 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-20 13:25:58 +08:00
Mickaël Seznec
a597a57595
[Attention] Flash Attention 3 - fp8 ( #14570 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-03-20 01:14:20 -04:00
Chauncey
ae65f3e237
[Misc]fixed disable these http request logs ( #14754 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-19 21:53:40 -07:00
Roger Wang
34868b106a
[Doc] Update Mistral Small 3.1/Pixtral example ( #15184 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-20 04:46:06 +00:00
Russell Bryant
1f16b7fe74
[Core][V0] Add guidance backend for structured output ( #14589 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Loc Huynh <lohuynh@microsoft.com >
Co-authored-by: Michal Moskal <michal@moskal.me >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-19 21:33:51 -07:00
Jennifer Zhao
b88be22165
[Benchmark] Allow oversample request in benchmark dataset ( #15170 )
...
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
2025-03-20 12:32:58 +08:00
Nicolò Lucchesi
d8c6d7d6b5
[V1][TPU] Support V1 Sampler for ragged attention ( #14227 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-19 21:00:39 -07:00
Wang, Yi
40828ce5fe
fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… ( #14673 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
2025-03-19 20:56:16 -07:00
Cyrus Leung
ffa443afed
[Bugfix] Fix embedding assignment for InternVL-based models ( #15086 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-20 03:40:13 +00:00
Jovan Sardinha
70e500cad9
Fix broken tests ( #14713 )
...
Signed-off-by: JovanSardinha <jovan.sardinha@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-03-20 02:06:49 +00:00
Rui Qiao
4cb1c05c9e
[Doc] Clarify run vllm only on one node in distributed inference ( #15148 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-03-20 09:55:59 +08:00
Nick Hill
c47aafa37c
[BugFix] Lazily import XgrammarBackend to avoid early cuda init ( #15171 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-20 01:30:43 +00:00
Alexander Matveev
cfbca8a2f2
[V1] TPU - Tensor parallel MP support ( #15059 )
2025-03-20 00:55:18 +00:00
Simon Mo
0fe5609874
[Docs] Annouce Ollama and Singapore Meetups ( #15161 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-19 16:18:04 -07:00
Nick Hill
22d33baca2
[FrontEnd][Perf] merge_async_iterators fast-path for single-prompt requests ( #15150 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-19 21:04:41 +00:00
iefgnoix
b0e96aaebb
[V1][TPU] Change kv cache shape. ( #15145 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-03-19 12:16:42 -07:00
Wang Ran (汪然)
8310e0b59b
simple bugfix: Update stats.py ( #15139 )
2025-03-19 18:26:27 +00:00
maobaolong
26dd972adb
[FEAT]Support reset prefix cache by specified device ( #15003 )
2025-03-19 10:54:41 -07:00
Murali Andoorveedu
61c7a1b856
[V1] Minor V1 async engine test refactor ( #15075 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: andoorve <murali.andoorveedu@mail.utoronto.ca >
Co-authored-by: andoorve <murali.andoorveedu@mail.utoronto.ca >
2025-03-19 10:37:17 -07:00
Alessandro Sangiorgi
374ee287d8
[Frontend] Remove custom_cache_manager ( #13791 )
...
Signed-off-by: fulvius31 <asangior@redhat.com >
2025-03-20 00:13:50 +08:00
Kero Liang
a4d83661d7
[Misc] Update the "the first vLLM China Meetup" slides link to point to the first page ( #15134 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-03-19 15:07:39 +00:00
Jan Kaniecki
8363cd093d
[Bugfix] Adjust mllama to regional compilation ( #15112 )
...
Signed-off-by: Jan Kaniecki <jkaniecki@habana.ai >
2025-03-19 07:57:25 -07:00
Aaron Pham
6c5a3195db
[Misc][Benchmark] Add support for different tokenizer_mode ( #15040 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-03-19 14:56:50 +00:00
Marc-Alexandre Côté
073d1ed354
[Doc] Update tip info on using latest transformers when creating a custom Dockerfile ( #15070 )
2025-03-19 13:33:40 +00:00
Cyrus Leung
3d446433ec
[Bugfix] Fix size calculation of processing cache ( #15114 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-19 05:53:19 -07:00
Cyrus Leung
1fe0fd12d3
[Misc] Avoid unnecessary HF do_rescale warning when passing dummy data ( #15107 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-19 03:42:31 -07:00
Roger Wang
dafb4e504a
[V1][Bugfix] Fix oracle for device checking ( #15104 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-19 18:35:32 +08:00
Kunshang Ji
68cf1601d3
[CI][Intel GPU] update XPU dockerfile and CI script ( #15109 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-03-19 01:29:25 -07:00
Cyrus Leung
61f412187d
[Bugfix] Re-enable Gemma3 for V1 ( #14980 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-18 23:58:22 -07:00
Woosuk Kwon
05ccd0aa35
[V1] Ensure using int64 for sampled token ids ( #15065 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-18 23:52:19 -07:00
Cyrus Leung
f690372b68
[Core] Update dtype detection and defaults ( #14858 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-19 13:49:33 +08:00
Brayden Zhong
8b3e94a357
[Model] Remove duplicated message check in Mistral chat completion request ( #15069 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-03-19 05:09:32 +00:00
Julien Denize
437f9162d0
[Model] Pixtral: Remove layer instantiation duplication ( #15053 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2025-03-19 10:34:03 +08:00
Cody Yu
4f065f12f5
[Misc][V1] Skip device checking if not available ( #15061 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-18 19:33:43 -07:00
Jennifer Zhao
228b768db6
[Doc] Minor v1_user_guide update ( #15064 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-03-18 16:10:45 -07:00
Chujie Zheng
027827cc1d
fix long dtype in topk sampling ( #15049 )
2025-03-18 15:57:31 -07:00
Alexander Matveev
72a8639b68
[V1] TPU - CI/CD use smaller model ( #15054 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-18 21:39:21 +00:00
Woosuk Kwon
99abb8b650
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels ( #14930 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-18 14:31:54 -07:00
Russell Bryant
3a1e648158
[V1] Refactor Structured Output for multiple backends ( #14694 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-18 19:49:15 +00:00
Jee Jee Li
46c759c165
[Bugfix] Fix LoRA extra vocab size ( #15047 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-18 09:40:29 -07:00
Isotr0py
179a619c21
[Bugfix] Fix broken CPU quantization due to triton import ( #15038 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-18 08:57:39 -07:00
yury-tokpanov
452e8fd968
[MODEL] Add support for Zamba2 models ( #13185 )
...
Signed-off-by: Yury Tokpanov <yury@zyphra.com >
Signed-off-by: Quentin Anthony <qganthony@yahoo.com >
Co-authored-by: Quentin Anthony <qganthony@yahoo.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-18 08:56:21 -07:00
ekuznetsov139
8b793f7ec6
MI325 configs, fused_moe_kernel bugfix ( #14987 )
...
Signed-off-by: Eugene Kuznetsov <eugene.kuznetsov@amd.com >
2025-03-18 08:05:18 -07:00
Nicolò Lucchesi
af35d3a3cc
[TPU][V1][Bugfix] Fix chunked prefill with padding ( #15037 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-18 07:34:45 -07:00
Simon Mo
3b457143d2
[Bugfix] Register serializers for V0 MQ Engine ( #15009 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-18 09:14:47 -04:00
Cyrus Leung
ab656f2c2f
[Bugfix] Loosen type check to avoid errors in V1 ( #15021 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-18 12:54:40 +00:00
Serena
64fc2193dc
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros ( #14347 )
2025-03-18 05:50:19 -07:00
Sebastian Schoennenbeck
dd732028f5
[Bugfix][Frontend] Fix validation of logprobs in ChatCompletionRequest ( #14352 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-03-18 05:50:05 -07:00
hoshi-hiyouga
414919138b
[Bugfix] torchrun compatibility ( #14899 )
...
Signed-off-by: hiyouga <hiyouga@buaa.edu.cn >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-18 05:49:27 -07:00
Jee Jee Li
db7c8ca910
[Misc] Embedding model support LoRA ( #14935 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-18 12:07:00 +00:00
Patrick von Platen
f863ffc965
[Mistral-Small 3.1] Update docs and tests ( #14977 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-18 03:29:42 -07:00
Varun Sundar Rabindranath
400d483e87
[Kernels] LoRA - Retire SGMV and BGMV Kernels ( #14685 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-18 09:47:53 +00:00
Shanshan Shen
d1695758b2
[Doc][V1] Fix V1 APC doc ( #14920 )
2025-03-18 08:15:46 +00:00
Liangfu Chen
53a0cf8b95
[Neuron] trim attention kernel tests to fit trn1.2x instance ( #14988 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-03-18 15:05:52 +08:00
Tristan Leclercq
5eeabc2a44
[Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights ( #14950 )
2025-03-17 23:27:26 +00:00
Alexander Matveev
18551e820c
[V1] TPU - Fix CI/CD runner ( #14974 )
2025-03-17 21:07:07 +00:00
Robert Shaw
e41e160263
[V1] Guard Against Main Thread Usage ( #14972 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-17 13:23:02 -07:00
Cyrus Leung
b89fb2a4a1
[CI/Build] Use AutoModelForImageTextToText to load VLMs in tests ( #14945 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 18:35:17 +00:00
Roger Wang
5340b0e221
[Bugfix] Fix interface for Olmo2 on V1 ( #14976 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-17 11:26:38 -07:00
Roger Wang
37e3806132
[Bugfix] Make Gemma3 MM V0 only for now ( #14971 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-17 10:04:21 -07:00
Aaron Pham
c0efdd655b
[Fix][Structured Output] using vocab_size to construct matcher ( #14868 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-03-17 11:42:45 -04:00
Quentin
aaaec52ad9
[Bugfix][Model] Mixtral: use unused head_dim config argument ( #14961 )
...
Signed-off-by: Quentin Torroba <quentin.torroba@mistral.ai >
2025-03-17 07:44:18 -07:00
Tyler Michael Smith
e1eb45d397
[Bugfix] Fix precommit - line too long in pixtral.py ( #14960 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 07:18:50 -07:00
Simon Mo
89fca671fb
[V1] Default MLA to V1 ( #14921 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-17 06:54:40 -07:00
Patrick von Platen
d20b0c139c
Add patch merger ( #14957 )
2025-03-17 06:47:50 -07:00
Cyrus Leung
166a168b0f
[Doc] Fix misleading log during multi-modal profiling ( #14955 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 06:14:32 -07:00
vllmellm
2bb0e1a799
[Bugfix][ROCm] running new process using spawn method for rocm in tests. ( #14810 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-17 11:33:35 +00:00
Cyrus Leung
6eaf1e5c52
[Misc] Add --seed option to offline multi-modal examples ( #14934 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 03:00:17 -07:00
Cyrus Leung
868a8c5b2c
[Bugfix] Fix Ultravox on V1 ( #14929 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 17:15:20 +08:00
iefgnoix
b4ad56c1bd
[V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. ( #14846 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-03-17 01:48:28 -07:00
kushanam
69698f257e
fix minor miscalled method ( #14327 )
2025-03-17 01:47:58 -07:00
Lu Fang
cd0cd85102
[MISC] More AMD unused var clean up ( #14926 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-03-17 16:40:41 +08:00
Russell Bryant
0a74bfce9c
setup.py: drop assumption about local main branch ( #14692 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-17 01:37:42 -07:00
Chen Zhang
dd3b865854
[Doc] Add vLLM Beijing meetup slide ( #14938 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-03-17 16:29:36 +08:00
Yan Ma
9b87a579aa
[Misc][XPU] Use None as device capacity for XPU ( #14932 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-03-17 01:22:14 -07:00
Cyrus Leung
b539222d4e
[V1] Remove input cache client ( #14864 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-16 23:42:06 -07:00
Lily Liu
8d6cf89526
[V1] [Spec Decode] Support random sampling for spec decode ( #13933 )
...
Create Release / Create Release (push) Has been cancelled
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-16 22:00:20 -07:00
Simon Mo
583a9778e0
[Benchmark] Do not save detailed info to json by default ( #14879 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-16 21:48:11 -07:00
Sibi
a73e183e36
[Misc] Replace os environ to monkeypatch in test suite ( #14516 )
...
Signed-off-by: sibi <85477603+t-sibiraj@users.noreply.github.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-16 20:35:57 -07:00
Lucas Wilkinson
1e799b7ec1
[BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context ( #14910 )
2025-03-17 03:35:37 +00:00
Woosuk Kwon
7f6c5ee06c
[V1][Minor] Add __repr__ to ConstantList ( #14907 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-16 20:20:15 -07:00
Woosuk Kwon
faa0275730
[V1] Optimize the overhead of rewinding ( #14905 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-16 20:19:30 -07:00
Cyrus Leung
8a5a9b70d7
[CI/Build] Update defaults for test reproducibility ( #14893 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 10:38:15 +08:00
Robert Shaw
bb3aeddfaf
[CI] Nightly Tests ( #14898 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-03-17 02:06:43 +00:00
Robert Shaw
aecc780dba
[V1] Enable Entrypoints Tests ( #14903 )
2025-03-16 17:56:16 -07:00
Vadim Gimpelson
90df7f23aa
[Doc] Add guidance for using ccache with pip install -e . in doc ( #14901 )
2025-03-16 23:10:04 +00:00
Rui Qiao
b9b5bdfc7d
[Misc] Catching Ray Compiled Graph PP test failures for V1 ( #14847 )
2025-03-16 15:46:42 -07:00
Woosuk Kwon
31060b2757
[V1][BugFix] Detect interleaved sliding window attention ( #14896 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-16 14:53:53 -07:00
Nick Hill
fc1f67715d
[BugFix][V1] Fix overhead related to bad_words sampling when not in use ( #14894 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-16 14:53:34 -07:00
Cyrus Leung
f6137adbcb
Revert "[Bugfix] Limit profiling run sequence length by max_model_len ( #14785 ) ( #14892 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-16 09:13:46 -07:00
Cyrus Leung
e53b1350f2
[Bugfix] Explicitly disable Phi-4-multimodal in V1 ( #14889 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-16 09:05:40 -07:00
Kyle Sayers
d30aa7e9e6
[Bugfix] Limit profiling run sequence length by max_model_len ( #14785 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-03-16 07:44:19 -07:00
Lily Liu
d1ad2a57af
[V1] [Spec Decode] Fix ngram tests ( #14878 )
2025-03-16 00:29:22 -07:00
Nick Hill
b82662d952
[BugFix] Fix torch distributed stateless PG backend init ( #14870 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-15 20:26:19 -07:00
Simon Mo
71c1e07107
[Kernel] Add more tuned configs ( #14877 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-15 20:25:03 -07:00
Roger Wang
b30c75dda4
[V1] Remove V0 fallback for mistral-tokenizer ( #14873 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-15 20:21:11 -07:00
Isotr0py
def232e122
[VLM] Clean up Phi-4-MM ViT implementation ( #14812 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-03-15 18:53:52 -07:00
Roger Wang
3453b964a3
[Misc][Doc] Minor benchmark README update ( #14874 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-16 09:46:17 +08:00
Rémi Delacourt
61c6a5a796
[VLM] Merged multi-modal processor for Pixtral ( #12211 )
...
Signed-off-by: remi <remi@mistral.ai >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-15 06:28:27 -07:00
Jun Duan
74bc397b0a
[Core] Expose API endpoint /is_sleeping ( #14312 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-03-15 06:28:14 -07:00
Kunshang Ji
f58aea002c
[CI][Intel GPU] refine intel GPU ci docker build ( #14860 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-03-15 11:58:53 +00:00
Cyrus Leung
3556a41434
[VLM] Limit multimodal input cache by memory ( #14805 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-15 02:52:05 -07:00
Bryan Lu
9ed6ee92d6
[Bugfix] EAGLE output norm bug ( #14464 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
2025-03-15 06:50:33 +00:00
Russell Bryant
ee3778d5fc
[Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes ( #14839 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-15 05:38:19 +00:00
Jennifer Zhao
aaacf17324
[Doc] V1 user guide ( #13991 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Jennifer Zhao <JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-14 22:17:59 -07:00
Aaron Pham
4c7629cae9
[V1][Structured Output] calculate vocab_size eagerly ( #14851 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-14 22:09:51 -07:00
Jee Jee Li
e0fdfa1608
[CI/Build] Delete LoRA bias test ( #14849 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-14 22:09:25 -07:00
Lucas Wilkinson
5952d8ab61
[Attention] Get rid of mla cache alignment ( #14842 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-15 05:08:25 +00:00
Li, Jiang
a2ae496589
[CPU] Support FP8 KV cache ( #14741 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-03-14 22:07:36 -07:00
Simon Mo
877e352262
[Docs] Add new East Coast vLLM Meetup slides to README and meetups.md ( #14852 )
2025-03-14 22:06:38 -07:00
Robert Shaw
d4d93db2c5
[V1] V1 Enablement Oracle ( #13726 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-03-14 22:02:20 -07:00
Lu Fang
8c0d15d5c5
[Misc][Easy] Annotate unused vars in the csrc files ( #14798 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-03-15 12:40:09 +08:00
Isotr0py
97ac781c62
[Misc] Remove misleading message in gemma2 and gemma3 ( #14850 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-14 21:35:12 -07:00
Russell Bryant
776dcec8fe
Disable outlines cache by default ( #14837 )
2025-03-15 03:57:55 +00:00
Tyler Michael Smith
ccf02fcbae
Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… ( #14848 )
2025-03-14 20:45:42 -07:00
DefTruth
acaea3bb07
[Bugfix][V1] Fix flashinfer sampling ( #14815 )
2025-03-14 20:42:38 -07:00
Liangfu Chen
9f37422779
[Neuron][CI] update docker run command ( #14829 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-03-14 18:51:35 -07:00
yarongmu-google
dd344e0342
[Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … ( #14844 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-03-15 00:41:15 +00:00
Yuan Tang
54a8804455
[Doc] More neutral K8s deployment guide ( #14084 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-03-14 16:12:36 -07:00
Russell Bryant
bbd94a19fc
[Build/CI] Upgrade aiohttp to incldue CVE fix ( #14840 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-14 23:11:28 +00:00
Russell Bryant
233ffce1eb
[Build/CI] Move ninja to common deps ( #14835 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-14 21:25:28 +00:00
Richard Liu
40677783aa
[CI] Add TPU v1 test ( #14834 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2025-03-14 17:13:30 -04:00
Michael Goin
14f301b541
Update to torch==2.6.0 ( #12721 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: luka <luka@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-14 16:58:30 -04:00
Russell Bryant
46f98893dd
[V1] Fix model parameterization for structured output tests ( #14833 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-14 20:55:18 +00:00
Chih-Chieh Yang
fe66b34728
[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies ( #14778 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-03-14 16:36:18 -04:00
Alexei-V-Ivanov-AMD
270a5da495
Re-enable the AMD Entrypoints Test ( #14711 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-03-14 12:18:13 -07:00
Kevin H. Luu
7097b4cc1c
[release] Remove log cleanup commands from TPU job ( #14838 )
2025-03-14 11:59:52 -07:00
Yajie Wang
977a16772c
[Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 ( #14430 )
...
Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com >
2025-03-14 09:55:14 -07:00
daniel-salib
73deea2fdb
[Frontend] track server_load ( #13950 )
2025-03-14 09:53:17 -07:00
Mark McLoughlin
9d2b4a70f4
[V1][Metrics] Updated list of deprecated metrics in v0.8 ( #14695 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-15 00:45:25 +08:00
Russell Bryant
0b0d6421b2
[Frontend] Fix log message to use http vs https ( #14774 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-14 09:21:09 -07:00
Russell Bryant
1140991a7b
[V1] Fix vocab size calculation for structured output ( #14826 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-14 09:18:38 -07:00
Cyrus Leung
613c5bb945
[Bugfix] Fix Aria test loading ( #14823 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-14 09:11:23 -07:00
Guillaume Calmettes
fd8e055ffb
[BugFix]: properly catch templating error when preprocess input ( #13976 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-03-14 05:58:34 -07:00
Cyrus Leung
ab93f1360f
[VLM] Various cleanup and fixes ( #14806 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-14 05:58:19 -07:00
DefTruth
40253bab44
[Bugfix][W8A8] fixed cutlass block fp8 binding ( #14796 )
2025-03-14 03:32:42 -07:00
Woosuk Kwon
c77620d22d
[V1][Minor] Minor code cleanup for scheduling metrics ( #14800 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-14 08:21:28 +00:00
Jee Jee Li
989ecd2007
[Misc] Gemma3ForConditionalGeneration supports LoRA ( #14797 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-14 01:07:30 -07:00
WeiCheng
54cc46f3eb
[Bugfix] Fix small typo in the example of Streaming delimiter ( #14793 )
2025-03-14 08:05:17 +00:00
Cyrus Leung
601bd3268e
[Misc] Clean up type annotation for SupportsMultiModal ( #14794 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-14 00:59:56 -07:00
Li Wang
09269b3127
[BugFix]Fix performance serving benchmark when enable profiling ( #14737 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-03-14 07:02:05 +00:00
Thien Tran
27b50f1fe6
[Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel ( #14667 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-03-13 23:47:49 -07:00
Lucas Wilkinson
9532c49836
[Attention] MLA get rid of materialization ( #14770 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-13 23:39:02 -07:00
Roger Wang
0c2af17c76
[CI] Fix missing example model id in processor test ( #14787 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-14 13:52:15 +08:00
Jennifer Zhao
a6e0d096dd
[Feature] Add visionarena offline support for benchmark_throughput ( #14654 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Jennifer Zhao <JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-03-14 04:07:54 +00:00
Liangfu Chen
d3d4956261
[Neuron] flatten test parameterization for neuron attention kernels ( #14712 )
2025-03-13 20:46:56 -07:00
Nick Hill
4059adc31b
[Misc][Minor] Simplify SamplingParams.__post_init__() ( #14772 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-14 11:44:20 +08:00
Kevin H. Luu
f1f632d9ec
[ci] Reduce number of tests in fastcheck ( #14782 )
2025-03-13 20:43:45 -07:00
Thien Tran
95d680b862
[Bugfix][IPEX] Add VLLM_CPU_MOE_PREPACK to allow disabling MoE prepack when CPU does not support it ( #14681 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-03-13 20:43:18 -07:00
Thomas Parnell
fb4c7f8ef0
[Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA. ( #14431 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com >
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
2025-03-13 20:42:27 -07:00
Varun Sundar Rabindranath
0b1cfa6180
[Kernel] LoRA - Enable CUDAGraphs for V1 ( #14626 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-13 20:42:04 -07:00
Woosuk Kwon
32ef4983cd
[V1] Temporarily disable FlashInfer Rejection Sampler ( #14788 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-13 20:40:35 -07:00
Roger Wang
ad19c8a003
[V1] Move OOM check into sampler run ( #14728 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-03-13 20:40:23 -07:00
Jeff Daily
2a602b055a
forward fix PR 14245, restore build on ROCm 6.2 ( #14709 )
...
Signed-off-by: Jeff Daily <jeff.daily@amd.com >
2025-03-13 20:40:15 -07:00
Alexander Matveev
7888e1d0a3
[V1] TPU - Enable prefix caching by default ( #14773 )
2025-03-13 20:40:05 -07:00
Chen Zhang
60c872d4b6
[Doc] Fix small typo in Transformers fallback ( #14791 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-03-13 20:33:12 -07:00
yasu52
3fb17d26c8
[Doc] Fix typo in documentation ( #14783 )
...
Signed-off-by: yasu52 <tsuguro4649@gmail.com >
2025-03-13 20:33:09 -07:00
Lucas Wilkinson
d47807ba08
[Attention] Remove slow setattr in MLA ( #14769 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-13 21:31:14 +00:00
afeldman-nm
02fcaa3d0a
[V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output ( #14624 )
...
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
2025-03-13 19:07:34 +00:00
Aaron Pham
8a4a2efc6f
[V1][Core] using cached vocab_size for Structured Outputs ( #14630 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-13 11:39:28 -07:00
Cyrus Leung
8e9ffd37d6
[Misc] Clean up processor tests ( #14771 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-13 18:25:37 +00:00
Woosuk Kwon
01b3fd0af7
[V1][Minor] Minor enhancements on scheduler ( #14732 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-13 08:53:22 -07:00
Cyrus Leung
f53a0586b9
[Bugfix] Fix prompt format of GLM4V ( #14539 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-13 11:37:17 +00:00
Isotr0py
b1cc4dfef5
[VLM] Support loading InternVideo2.5 models as original InternVLChatModel ( #14738 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-13 03:10:02 -07:00
Cyrus Leung
382403921f
[VLM] Support pan-and-scan for Gemma3 multi-modal processor ( #14672 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-13 02:23:12 -07:00
Jee Jee Li
a73122de96
[Bugfix] fix benchmark moe ( #14653 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-13 16:12:42 +08:00
Jee Jee Li
bd44b812cb
[CI/Build] Delete ultravox LoRA test ( #14730 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-13 07:57:39 +00:00
Szymon Ożóg
55211b01e8
[Bugfix] Fix chunked prefill for GGUF ( #14666 )
...
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com >
2025-03-13 07:19:03 +00:00
Kyle Sayers
5d043c1685
[Quant] Bamba SupportsQuant ( #14698 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-03-13 04:57:05 +00:00
Kyle Sayers
36d1ccb286
[Quant] BartModel SupportsQuant ( #14699 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-03-13 04:55:59 +00:00
Siyuan Liu
1bc3b739c4
[V1][TPU] Add assertion on multi-step-scheduler ( #14707 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-03-12 21:37:58 -07:00
Mathis Felardos
1bd32bc8dd
[Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config ( #14367 )
...
Signed-off-by: Mathis Felardos <mathis@mistral.ai >
2025-03-12 20:15:20 -07:00
TY-AMD
128bf75283
[BugFix][TritonMLA] Process weights after model loading for GGUF ( #14555 )
...
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
2025-03-12 20:14:36 -07:00
Gregory Shtrasberg
a94a699c3f
[ROCm][FP8] Fix for adjustments needed only for fnuz ( #14689 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-03-12 20:14:04 -07:00
Richard Liu
ab426ec9c0
Add ray[data] as tpu dependency ( #14691 )
...
Signed-off-by: <ricliu@google.com >
Signed-off-by: Richard Liu <ricliu@google.com >
2025-03-12 20:13:48 -07:00
Joe Runde
165290d357
[bugfix] fixup warning message for plugged schedulers for v1 ( #14700 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-03-12 20:12:13 -07:00
Kevin H. Luu
ce20124671
[release] Add force remove for TPU logs ( #14697 )
2025-03-12 22:35:18 +00:00
Woosuk Kwon
53be4a8634
[V1] Allow sliding window + prefix caching ( #13069 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-12 11:21:19 -07:00
Nick Hill
f5d3acd474
[BugFix][V1] Fix parallel sampling finishing/aborts ( #14512 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-12 10:29:48 -07:00
TJian
916836bbfb
[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. ( #14664 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-03-12 09:31:19 -07:00
Sage Moore
d9f83d6206
[ROCm] Enable chunked prefill/paged attention in MLA on ROCm ( #14316 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-03-12 15:51:20 +00:00
ameyanjarlekar
4a754fcf15
[Bugfix] Missing thumbnail from NVLM-D processor ( #14633 )
...
Signed-off-by: ameyanjarlekar <aanjarlekar@nvidia.com >
2025-03-12 08:50:49 -07:00
Woosuk Kwon
c0c25e25fa
[Model] Add support for Gemma 3 ( #14660 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-12 08:36:33 -07:00
Sage Moore
45f3f3f59e
[ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. ( #14629 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-03-12 08:00:28 -04:00
Li, Jiang
ff47aab056
[CPU] Upgrade CPU backend to torch-2.6 ( #13381 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-03-12 10:41:13 +00:00
Pavani Majety
debd6bbf09
[Kernel] Add ModelOpt FP4 Checkpoint Support ( #12520 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-03-12 05:13:11 +00:00
Benjamin Chislett
5c538c37b2
[V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing ( #14645 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-03-11 22:12:41 -07:00
Szymon Ożóg
e22ee1e7a2
[Kernel] GGUF MoE kernel ( #14613 )
...
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com >
2025-03-12 03:33:27 +00:00
Isotr0py
e392d85831
[Core] Refactor QKVCrossParallelLinear implementation to support BNB 4-bit quantization ( #14545 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-11 20:12:52 -07:00
Aaron Pham
77a318bd01
[V1][Core] Support MistralTokenizer for Structured Output ( #14625 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-12 10:40:09 +08:00
Farzad Abdolhosseini
80e78d02ac
[Model] Extend Ultravox to accept audio longer than 30s ( #13631 )
...
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai >
2025-03-12 10:27:10 +08:00
Jennifer Zhao
4a42b9f5d6
[Doc] Update benchmarks README ( #14646 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-03-11 19:23:04 -07:00
Joe Runde
47532cd9f4
[core][V1] pluggable scheduler ( #14466 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-03-12 01:15:15 +00:00
Randy Chen
36e0c8f7da
[Feature] Add vllm bench CLI ( #13993 )
...
Signed-off-by: Randy Chen <acad.randyjhc@gmail.com >
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-12 00:31:48 +00:00
Kevin H. Luu
9f583e360c
[release] Add commands to clean up logs on TPU release node ( #14642 )
2025-03-12 00:14:50 +00:00
Cody Yu
b706d898af
[Bugfix][V1][PP] Only warmup sampler at last PP rank ( #14643 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-11 23:40:07 +00:00
iefgnoix
863d315c86
[V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly ( #14597 )
2025-03-11 19:12:26 -04:00
Richard Liu
d374f04a33
Fix run_tpu_test ( #14641 )
...
Signed-off-by: <ricliu@google.com >
Signed-off-by: Richard Liu <ricliu@google.com >
2025-03-11 21:14:33 +00:00
Russell Bryant
61a01b27a7
[V1] Delay all xgrammar usage until needed ( #14616 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-11 20:21:33 +00:00
Yang.Tao
53056731fd
fix some typos : supported_head_sizes ( #14627 )
2025-03-11 10:38:24 -07:00
Russell Bryant
4cbf286794
[V1] Remove cache from StructuredOutputManager ( #14622 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-11 10:36:07 -07:00
Kunshang Ji
c6e14a61ab
[Hardware][Intel GPU] upgrade IPEX dependency to 2.6.10. ( #14564 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-03-11 17:11:47 +00:00
Lucas Wilkinson
07b4b7a37f
[BugFix/Build] Fix sparse kernels not getting built on hopper ( #14572 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-11 17:09:03 +00:00
Dilip Gowda Bhagavan
07964e2f30
docs: Add documentation for s390x cpu implementation ( #14198 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-11 17:02:17 +00:00
Russell Bryant
4bf82d4b90
[V1] Add regex structured output support with xgrammar ( #14590 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-11 23:03:44 +08:00
Richard Liu
9ab326713f
Uninstall dependencies before installing requirements/tpu.txt ( #14586 )
...
Signed-off-by: <ricliu@google.com >
Signed-off-by: Richard Liu <ricliu@google.com >
2025-03-11 08:01:35 -07:00
Cyrus Leung
af295e9b01
[Bugfix] Update --hf-overrides for Alibaba-NLP/gte-Qwen2 ( #14609 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-11 07:59:43 -07:00
Jeff Daily
a1c8f3796c
dynamic distpatch of fp8 kernels ( #14245 )
...
Signed-off-by: Jeff Daily <jeff.daily@amd.com >
2025-03-11 10:54:56 -04:00
Russell Bryant
08a1a1121d
benchmarks: simplify test jsonschema ( #14567 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-11 13:39:30 +00:00
Isotr0py
1477ffc381
[VLM] Cleanup siglip legacy code and fix broken paligemma multimodal processor ( #14602 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-11 11:27:36 +00:00
yexin(叶鑫)
70b808fe1a
[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync ( #14377 )
...
Signed-off-by: cynthieye <987073381@qq.com >
2025-03-11 07:39:56 +00:00
Isotr0py
63d635d179
[Misc] Correct deepseek-vl2 chat template ( #14558 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-11 04:37:11 +00:00
Roger Wang
1fc973c0b5
[V1][Core] Fix memory issue with logits & sampling ( #14508 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Varun Sundar Rabindranath <3337719+varun-sundar-rabindranath@users.noreply.github.com >
2025-03-11 04:03:41 +00:00
Concurrensee
c982ac5722
[Bugfix] Fix FP16 overflow for DeepSeek V2 ( #13232 )
...
Signed-off-by: Yida Wu <yida.wu@amd.com >
2025-03-10 20:46:59 -07:00
Cody Yu
4290b704ff
[V1][PP] Do not block engine core when no requests to schedule ( #14585 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-10 19:48:24 -07:00
Liangfu Chen
c91b64f749
[neuron] add reshape_and_cache ( #14391 )
2025-03-10 18:37:29 -07:00
gnovack
d6123170d5
[Neuron] Add Neuron device communicator for vLLM v1 ( #14085 )
2025-03-10 18:37:04 -07:00
Cody Yu
485afdd3cb
[MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils ( #14379 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-10 20:42:11 -04:00
Jinzhen Lin
90e88ab756
[Kernel] moe wna16 cuda kernel ( #13321 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-03-10 20:12:40 -04:00
Russell Bryant
04421dff8a
[V1] Prevent xgrammar from breaking TPU support ( #14575 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-10 23:06:19 +00:00
Russell Bryant
432d6dad15
Fix typo in benchmark_serving_structured_output.py ( #14566 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-10 14:58:58 -07:00
Varun Sundar Rabindranath
5ff0d32580
[V1] LoRA - Add triton kernels for V1 ( #13096 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-10 17:27:53 -04:00
Woosuk Kwon
0967110e42
[Minor] Update the tqdm bar for parallel sampling ( #14571 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-10 14:23:48 -07:00
Simon Mo
fb0acb6c72
[Perf] Improve MLA on V1 ( #14540 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-10 12:06:58 -07:00
Chauncey
92b0ce2ac7
[Bugfix][v1] fixed llava-hf/llava-1.5-7b-hf is broken on V1 ( #14554 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-10 18:24:51 +00:00
Harry Mellor
bc2d4473bf
[Docs] Make installation URLs nicer ( #14556 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-10 10:43:08 -07:00
Harry Mellor
3b352a2f92
Correct capitalisation: VLLM -> vLLM ( #14562 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-10 16:36:21 +00:00
Roger Wang
dea985aef0
[V1][Bugfix] Fix handing of second_per_grid_ts for Qwen2-VL & Qwen2.5-VL ( #14548 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-10 16:03:11 +00:00
Harry Mellor
39be30351f
Correct capitalisation: Github -> GitHub ( #14561 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-10 15:53:33 +00:00
Cyrus Leung
001a9c7b0d
[Doc] Update PaliGemma note to a warning ( #14565 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-10 15:02:28 +00:00
Szymon Ożóg
89cdaa83e7
[Kernel] Add more dtype support for GGUF kernels ( #14043 )
...
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com >
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com >
2025-03-10 07:30:04 -07:00
Chauncey
b0746fae3d
[Frontend] support image embeds ( #13955 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-10 12:36:03 +00:00
Harry Mellor
60a98b2de5
[Docs] Mention model_impl arg when explaining Transformers fallback ( #14552 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-10 12:13:10 +00:00
Chauncey
460f553a6d
[Misc] Add log information for handle_process_request. ( #14130 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-10 08:40:50 +00:00
Jennifer Zhao
1253b15774
[Feature] Consolidate performance benchmark datasets ( #14036 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-10 07:23:11 +00:00
Martin Hoyer
dc74613fa2
[Bugfix] Wrong requirements path - rocm ( #14527 )
...
Signed-off-by: Martin Hoyer <mhoyer@redhat.com >
2025-03-10 02:49:46 +00:00
Yanyi Liu
a21076ed3a
[Misc] Ensure out-of-tree quantization method recognize by cli args ( #14328 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
2025-03-09 12:13:31 +00:00
Chengji Yao
212007b168
[Hardware][TPU] Fix the recompiling issue in logits processor after warmup ( #14510 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-03-09 05:44:39 -04:00
Isotr0py
fb16eea48b
[Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work ( #14498 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-09 04:47:45 +00:00
Yuchen Yan
73ae0b44e9
[Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 ( #12428 )
...
Signed-off-by: Yuchen Yan <740987012@qq.com >
2025-03-08 20:14:53 -08:00
Jiayi Yao
6d7f037748
[Feat] Support chunked prefill for LMCache connector ( #14505 )
...
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn >
2025-03-08 19:30:06 -08:00
iefgnoix
10f7552789
[V1][TPU] Remove unnecessary padding for running on TPU. ( #14467 )
2025-03-08 21:56:04 -05:00
Lucas Wilkinson
b0d541947a
[Attention] Default to FlashMLA backend for MLA ( #14451 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-08 18:18:39 -08:00
Robert Shaw
5f0b53c6ea
Revert "[V1][Core] Fix memory issue with logits & sampling" ( #14504 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-08 17:43:37 -08:00
22quinn
eb8b5eb183
[V1] Support bad_words in sampler ( #13376 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-03-08 14:50:26 -08:00
Cyrus Leung
9513290032
[Misc] Upgrade to Python 3.9 typing for additional directories ( #14492 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-08 17:35:50 +00:00
Russell Bryant
0d5e73d30e
Update CODEOWNERS for structured output ( #14496 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-08 17:19:51 +00:00
Isotr0py
609ef61fea
[Bugfix] Fix profiling OOM and decouple encoder multimodal profiling ( #14361 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-08 16:52:34 +00:00
Lucas Wilkinson
db84f5eb3b
[Bugfix] DeepSeek Accuracy ( #14476 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-08 16:47:03 +00:00
Harry Mellor
206e2577fa
Move requirements into their own directory ( #12547 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-08 16:44:35 +00:00
Cyrus Leung
e02883c400
[Misc] Don't run ruff at all on 3rd party libs ( #14493 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-08 07:16:40 -08:00
Russell Bryant
9085aabd62
[benchmarks] Add option to use unique jsonschema for each request ( #14457 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-08 06:36:39 -08:00
Roger Wang
8d5aa466fb
[V1][Core] Fix memory issue with logits & sampling ( #13776 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-08 06:11:04 -08:00
Aaron Pham
0b7f06b447
[Misc] add use_tqdm_on_load to reduce logs ( #14407 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-08 05:57:46 -08:00
Isotr0py
03fe18ae0f
[VLM] Add TP support for Phi-4-MM ( #14453 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-08 05:57:14 -08:00
Alexander Matveev
cb8bdfade2
[V1] TPU - Add tensor parallel support via Ray ( #13618 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-08 08:19:38 -05:00
Cyrus Leung
33f227e16b
[CI/Build] Use a fixed seed to avoid flaky tests ( #14480 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-08 11:30:09 +00:00
Harry Mellor
cfd0ae8234
Add RLHF document ( #14482 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-08 09:51:39 +00:00
Lucas Wilkinson
7caff01a7b
[Build/BugFix] Fix hopper 12.8 build ( #14354 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-08 08:11:56 +00:00
Harry Mellor
be0b399d74
Add training doc signposting to TRL ( #14439 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-08 07:35:07 +00:00
Jee Jee Li
b8b0ccbd2d
[Bugfix] Make the deviceprofiler include LoRA memory. ( #14469 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-08 07:12:22 +00:00
Robin
c908a07f57
[Doc] Added QwQ-32B to the supported models list in the reasoning out… ( #14479 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-03-08 07:07:32 +00:00
Robin
7b6fd6e486
[Doc]add doc for Qwen models tool calling ( #14478 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-03-08 06:58:46 +00:00
Harry Mellor
47512b3200
Default to generation_config from model ( #12622 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-08 14:46:15 +08:00
Roger Meier
3b9c6c6947
[CI/Build] refactor: set timezone of container to UTC ( #12888 )
...
Signed-off-by: Roger Meier <r.meier@siemens.com >
2025-03-07 22:42:01 -08:00
Aviv Keshet
4aae667668
[core] add extra_args to SamplingParams ( #13300 )
...
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com >
2025-03-08 14:41:18 +08:00
Cody Yu
9f3bc0f58c
[MISC][V1] Register process killing handler only in the main thread ( #14380 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-07 22:40:06 -08:00
Mathis Felardos
980385f8c1
[Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache ( #14369 )
...
Signed-off-by: Mathis Felardos <mathis@mistral.ai >
2025-03-07 22:39:31 -08:00
Tyler Michael Smith
ca7a2d5f28
Revert "[Perf] Reduce MLA CPU overheads in V1 ( #14384 )" ( #14471 )
2025-03-07 22:18:53 -08:00
Tyler Michael Smith
333681408f
[Bugfix][V1] Handle MLA in kv_cache_interface ( #14462 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-07 22:18:25 -08:00
afeldman-nm
ef64044079
[V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC ( #13949 )
2025-03-08 01:48:12 +00:00
yarongmu-google
66e16a038e
[Bugfix] Fix torch_xla which can't handle None seed introduced in #14274 ( #14459 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-03-07 23:17:04 +00:00
Mark McLoughlin
e1f0835ae0
[V1][Metrics] Fix traceback with preemptions+LoRA ( #14220 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-07 15:36:16 -05:00
Nick Hill
8ed5421aaa
[V1] Eagerly remove finished requests from the batch ( #14388 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-07 10:56:00 -08:00
youkaichao
c6359e8ca6
[v1] torch.compile integration explanation ( #14437 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-08 01:55:50 +08:00
Jee Jee Li
952a074980
[Misc] Add Phi4-MM example ( #14343 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-07 17:28:52 +00:00
Jinzhen Lin
d0feea31c7
[Kernel] optimize performance of gptq marlin kernel when n is small ( #14138 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-03-07 11:53:38 -05:00
Jeremy Arnold
58abe35455
[Benchmarks] Make detokenization optional in benchmark scripts ( #11697 )
...
Signed-off-by: Jeremy Arnold <Jeremy.Arnold@amd.com >
2025-03-07 08:09:00 -08:00
York-RDWang
f7ebad2307
[Doc] Update prefix_caching.md to match the example image ( #14420 )
2025-03-07 15:29:00 +00:00
Aaron Pham
80e9afb5bc
[V1][Core] Support for Structured Outputs ( #12388 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-03-07 07:19:11 -08:00
iefgnoix
1e3598edeb
Use the optimized block sizes after tuning the kernel. ( #14329 )
2025-03-07 13:25:13 +00:00
Harry Mellor
f7a6bd0fa1
Fix missing kv_caches and attn_metadata in OpenVINOCausalLM ( #14271 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-07 12:30:42 +00:00
Aleksandr Malyshev
0ca3b8e01c
[BUGFIX] Skip tokenization support for throughput benchmark ( #12712 )
...
Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu >
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-03-07 02:51:47 -08:00
மனோஜ்குமார் பழனிச்சாமி
cc10281498
[Misc] Set default value of seed to None ( #14274 )
...
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
2025-03-07 10:40:01 +00:00
Cyrus Leung
05fb6718f0
[Bugfix] Clean up multi-modal processors ( #14417 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-07 10:33:38 +00:00
Jee Jee Li
12c29a881f
[Bugfix] Further clean up LoRA test ( #14422 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-07 10:30:55 +00:00
Peng Li
70da0c0748
correct wrong markdown syntax ( #14414 )
...
Signed-off-by: vincent-pli <justdoit.pli@gmail.com >
2025-03-07 08:01:18 +00:00
Cyrus Leung
c1588a2c94
[GH] Auto-apply multi-modality label to relevant PRs ( #14402 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-07 15:26:32 +08:00
Ilya Lavrenov
8ca7a71df7
OpenVINO: added CPU-like conditions ( #14338 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com >
2025-03-06 22:24:49 -08:00
Isotr0py
63137cd922
[Build] Add nightly wheel fallback when latest commit wheel unavailable ( #14358 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-06 22:10:57 -08:00
Jee Jee Li
ddd1ef66ec
[Bugfix] Fix JambaForCausalLM LoRA ( #14370 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-06 22:05:47 -08:00
Lucas Wilkinson
e5e03c2c1b
[BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs ( #14396 )
2025-03-06 21:56:06 -08:00
Luka Govedič
e1744502c2
[FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object ( #14390 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-03-07 05:20:16 +00:00
Lucas Wilkinson
dae6896977
[Perf] Reduce MLA CPU overheads in V1 ( #14384 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-06 19:59:14 -08:00
Brayden Zhong
c34eeec58d
[Bugfix] Correctly call cudaProfilerStop in benchmarks script ( #14183 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-03-07 00:42:49 +00:00
Daniel Li
ad60bbb2b2
[Doc] Fix a typo ( #14385 )
2025-03-06 16:31:52 -08:00
Chengji Yao
0578e5a462
[Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue ( #14310 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-03-06 23:31:05 +00:00
Michael Goin
04222984f8
[Docs] Add nsight guide to profiling docs ( #14298 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-06 14:19:58 -08:00
Michael Goin
6832707e90
[V1][Bugfix] Standardize quantized kv cache rejection for attention backends ( #14221 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-06 14:18:29 -08:00
Michael Goin
6b2ef5cd17
[Bug] Fix Attention when ignored in by quant_method ( #14313 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-06 14:18:06 -08:00
Tyler Michael Smith
958adce478
[Bugfix] Fix use_direct_call condition in FusedMoE layer for ( #14382 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-06 14:17:21 -08:00
Tyler Michael Smith
99b0915d3b
[Kernel] Add needs_fixed_stride_order tag to most GEMMs ( #14306 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-06 14:17:09 -08:00
Thomas Parnell
8ca2b21c98
[CI] Disable spawn when running V1 Test ( #14345 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-03-06 21:52:46 +00:00
Michael Goin
d9292786e1
[CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa ( #13569 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-06 16:08:36 -05:00
Tyler Michael Smith
cc2f9b32c8
[Distributed] Add enable_expert_parallel arg ( #14305 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-06 18:54:45 +00:00
Himanshu Jaju
cd579352bf
[V1] Do not detokenize if sampling param detokenize is False ( #14224 )
...
Signed-off-by: Himanshu Jaju <hj@mistral.ai >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-03-06 10:40:24 -08:00
Ying Zhong
9f1710f1ac
Fix mla prefill context performance ( #13897 )
...
Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com >
2025-03-06 09:35:49 -08:00
Thomas Parnell
e642ec962c
Add authors to license header. ( #14371 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com >
2025-03-06 08:43:09 -08:00
Dilip Gowda Bhagavan
ada19210a3
Adding cpu inference with VXE ISA for s390x architecture ( #12613 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com >
Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com >
2025-03-06 08:40:53 -08:00
Harry Mellor
bf0560bda9
Reinstate best_of for V0 ( #14356 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-06 08:34:22 -08:00
youkaichao
151b08e0fe
[RLHF] use worker_extension_cls for compatibility with V0 and V1 ( #14185 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-07 00:32:46 +08:00
Jitse Klomp
81b2f4a45f
[Doc] Fix date typo in README.md ( #14366 )
...
Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl >
2025-03-06 08:29:57 -08:00
Cyrus Leung
82551ad616
[Core] Don't use cache during multi-modal profiling ( #14336 )
2025-03-06 08:03:31 -08:00
courage17340
caac5c2e59
[Bugfix][Core] fix abort_seq_group and memory leak when n>1 ( #14326 )
...
Signed-off-by: courage17340 <courage17340@163.com >
2025-03-06 23:59:32 +08:00
Thomas Parnell
6bd1dd9d26
[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend ( #14152 )
2025-03-06 07:39:16 -08:00
Irina Yuryeva
4f27044aab
[Doc] Correct beam_search using in generative_models.md ( #14363 )
2025-03-06 15:37:10 +00:00
Yanyi Liu
0ddc991f5c
[Doc] Update reasoning with stream example to use OpenAI library ( #14077 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
2025-03-06 13:20:37 +00:00
Nicolò Lucchesi
fa82b93853
[Frontend][Docs] Transcription API streaming ( #13301 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-06 10:39:35 +00:00
Nicolò Lucchesi
69ff99fdcd
[Core] Optimizing cross-attention QKVParallelLinear computation ( #12325 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal>
Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal>
2025-03-06 09:37:26 +00:00
lkchen
5d802522a7
[V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 ( #14275 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-03-06 08:58:41 +00:00
kYLe
1769928079
[Model] Update Paligemma multimodal processing with PromptUpdate ( #14015 )
...
Signed-off-by: Kyle Huang <kylhuang@nvidia.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-03-06 08:31:38 +00:00
Pavani Majety
ed6ea06577
[Hardware] Update the flash attn tag to support Blackwell ( #14244 )
2025-03-05 22:01:37 -08:00
Nicolò Lucchesi
5ee10e990d
[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention ( #11301 )
2025-03-05 20:00:53 -08:00
Varun Sundar Rabindranath
3dbd2d813a
[V1] LoRA - Enable more V1 tests ( #14315 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-06 11:55:42 +08:00
Ce Gao
f5f7f00cd9
[Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 ( #14114 )
2025-03-06 03:49:20 +00:00
Rui Qiao
abcc61e0af
[misc] Mention ray list nodes command to troubleshoot ray issues ( #14318 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-03-06 02:00:36 +00:00
Lucas Wilkinson
f6bb18fd9a
[BugFix] MLA + V1, illegal memory access and accuracy issues ( #14253 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-05 17:10:13 -08:00
Yuan Tang
71eaf8969b
[Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation ( #13850 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-03-05 17:09:29 -08:00
Michael Goin
ca100c90fe
Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM ( #13917 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-05 17:08:51 -08:00
Russell Bryant
ffad94397d
[CI/Build] Use spawn multiprocessing mode for V1 test pipeline ( #14243 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-05 17:08:02 -08:00
Lucas Wilkinson
4dacaa4a83
[BugFix] Fix prefix caching V0 MLA ( #14255 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com >
2025-03-05 17:07:42 -08:00
Tyler Michael Smith
a7ea35aa67
[Bugfix] Remove num_tokens_across_dp ( #14302 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-05 23:55:55 +00:00
pyc96
1e3e76b6cc
[Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch ( #14237 )
...
Signed-off-by: pyc96 <pychen96@gmail.com >
2025-03-05 22:22:40 +00:00
Lu Fang
53ea6ad830
[V1][Easy] Add empty allowed_token_ids in the v1 sampler test ( #14308 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-03-05 21:41:18 +00:00
Serena
1b7624bf5c
[misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env ( #14267 )
2025-03-05 21:28:50 +00:00
Nick Hill
ac60dc7fe1
[V1][BugFix] Fix for mixed top_k batch ( #14301 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com >
2025-03-05 20:43:04 +00:00
Vincent
a4f1ee35d6
Deprecate best_of Sampling Parameter in anticipation for vLLM V1 ( #13997 )
...
Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com >
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-05 20:22:43 +00:00
Nick Hill
a32c8669ca
[V1][Minor] Remove obsolete FIXME comment ( #14304 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-05 11:59:23 -08:00
Simon Mo
ca2ca8de57
[Docs] Add Meta Slides ( #14297 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-05 08:30:23 -08:00
Isotr0py
f71b00a19e
[Bugfix] Fix broken vision language example ( #14292 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-05 15:57:10 +00:00
DaividFrank
8f808cf86e
prefix_caching.md: Fixed typo ( #14293 )
...
Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai >
2025-03-05 15:43:13 +00:00
Jee Jee Li
7bab4bb048
[Misc] Add Qwen2MoeForCausalLM moe tuning support ( #14276 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-05 23:11:29 +08:00
Isotr0py
e17e4488bd
[LoRA] Remove linear hack outside transformers backend ( #14177 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-05 15:06:28 +00:00
Robert Shaw
257e200a25
[V1][Frontend] Add Testing For V1 Runtime Parameters ( #14159 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-03-05 14:18:55 +00:00
Zhe Zhang
47d4a7e004
Small update for external_launcher backend docs ( #14288 )
2025-03-05 21:30:00 +08:00
Cyrus Leung
7f89a594dd
[Doc] [3/N] Refer code examples for common cases in dev multimodal processor ( #14278 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-05 12:29:50 +00:00
Iacopo Poli
961644e6a8
[Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID ( #14217 )
...
Signed-off-by: Iacopo Poli <iacopo@lighton.ai >
2025-03-05 11:44:10 +00:00
Lu Fang
8d6cd32b7b
[Bugfix][V1] Fix allowed_token_ids for v1 Sampler ( #14169 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-03-05 08:49:44 +00:00
Roger Wang
ec79b67c77
[Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing ( #14256 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-05 07:37:16 +00:00
Benjamin Chislett
32985bed7c
[Frontend] Allow return_tokens_as_token_ids to be passed as a request param ( #14066 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-03-05 06:30:40 +00:00
Michael Goin
dae9ec464c
Temporarily disable test_awq_gemm_opcheck ( #14251 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-05 06:10:35 +00:00
youkaichao
6eaf93020d
[platforms] improve rocm debugging info ( #14257 )
2025-03-04 21:32:18 -08:00
Tyler Michael Smith
72c62eae5f
[V1] EP/TP MoE + DP Attention ( #13931 )
2025-03-04 21:27:26 -08:00
Congcong Chen
0a995d5434
[Model] New model support for Phi-4-multimodal-instruct ( #14119 )
2025-03-04 20:57:01 -08:00
Cody Yu
ade3f7d988
[V1][Bugfix] Do not reset prefix caching metrics ( #14235 )
2025-03-05 04:39:13 +00:00
rainkert
0df25101d6
[Bugfix] Fix gptq_marlin for deepseek-v3 ( #13750 )
...
Signed-off-by: dangshunya <dangshunya@baichuan-inc.com >
Co-authored-by: dangshunya <dangshunya@baichuan-inc.com >
2025-03-05 12:25:53 +08:00
Michael Goin
e123aafdf0
Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 ( #14157 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-05 12:25:24 +08:00
Nishidha
5b143d33be
Moved numba from common requirements to cuda/rocm specific requirements ( #14199 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-03-05 12:25:00 +08:00
youkaichao
eb59b5a6cb
[misc] announce china meetup ( #14248 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-05 10:33:50 +08:00
Michael Goin
fbfc3ee37e
[V1][TPU] TPU multimodal model support for ragged attention ( #14158 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-03-04 19:58:48 -05:00
Sage Moore
3e1d223626
[ROCm] Disable a few more kernel tests that are broken on ROCm ( #14145 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-03-04 23:37:55 +00:00
Tyler Michael Smith
4f5b059f14
Clean up unused padding_idx variables across many model definitions ( #13240 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-04 21:27:00 +00:00
Kuntai Du
288ca110f6
[Security] Serialize using safetensors instead of pickle in Mooncake Pipe ( #14228 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-03-04 21:10:32 +00:00
Mark McLoughlin
c2bd2196fc
[v1][Metrics] Add design doc ( #12745 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-04 20:36:55 +00:00
Michael Goin
550c7ba3dc
[Docs] Update Dockerfile dependency image ( #14215 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-04 20:22:11 +00:00
Harry Mellor
e5b2f1601a
[Frontend] Do prompt_logprobs clamping for chat as well as completions ( #14225 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-04 20:13:06 +00:00
Harry Mellor
9badee53de
Fix performance when --generation-config is not None ( #14223 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-04 20:59:22 +01:00
Siyuan Liu
beebf4742a
[TPU][Profiler] Support start_profile/stop_profile in TPU worker ( #13988 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-03-04 14:40:06 -05:00
kushanam
f89978ad7c
add cutlass support for blackwell fp8 gemm ( #13798 )
2025-03-04 07:55:07 -08:00
lkchen
b3cf368d79
[V1][Molmo] Fix get_multimodal_embeddings() in molmo.py ( #14161 )
2025-03-04 15:43:59 +00:00
Mark McLoughlin
c8525f06fc
[V0][Metrics] Deprecate some questionable request time metrics ( #14135 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-04 15:11:33 +00:00
Nick Hill
5db6b2c961
[V1][BugFix] Fix remaining sync engine client shutdown errors/hangs ( #13869 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-04 15:06:47 +00:00
Michael Goin
6247bae6c6
[Bugfix] Restrict MacOS CPU detection ( #14210 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-04 22:25:27 +08:00
youkaichao
3610fb4930
[doc] add "Failed to infer device type" to faq ( #14200 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-04 20:47:06 +08:00
youkaichao
71c4b40562
[sleep mode] error out with expandable_segments ( #14189 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-04 18:54:19 +08:00
youkaichao
ac65bc92df
[platform] add debug logging during inferring the device type ( #14195 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-04 18:39:16 +08:00
Michael Goin
f78c0be80a
Fix benchmark_moe.py tuning for CUDA devices ( #14164 )
2025-03-03 21:11:03 -08:00
Zhanwen Chen
66233af7b6
Use math.prod instead of np.prod for trivial ops ( #14142 )
2025-03-03 21:09:22 -08:00
Rui Qiao
bf13d40972
[core] Pass all driver env vars to ray workers unless excluded ( #14099 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-03-04 11:44:17 +08:00
Cody Yu
989f4f430c
[Misc] Remove lru_cache in NvmlCudaPlatform ( #14156 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-04 11:09:34 +08:00
Divakar Verma
bb5b640359
[core] moe fp8 block quant tuning support ( #14068 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-03-04 01:30:23 +00:00
Travis Johnson
c060b71408
[Model] Add support for GraniteMoeShared models ( #13313 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-04 08:04:52 +08:00
iefgnoix
79e4937c65
[v1] Add comments to the new ragged paged attention Pallas kernel ( #14155 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-03-03 23:00:55 +00:00
Qubitium-ModelCloud
cd1d3c3df8
[Docs] Add GPTQModel ( #14056 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-03-03 21:59:09 +00:00
Michael Goin
19d98e0c7d
[Kernel] Optimize moe intermediate_cache usage ( #13625 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-03 16:29:53 -05:00
Michael Goin
2b04c209ee
[Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 ( #14100 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-03 14:20:24 -07:00
Mark McLoughlin
ae122b1cbd
[WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics ( #14055 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-03 19:04:45 +00:00
Nick Hill
872db2be0e
[V1] Simplify stats logging ( #14082 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-03 10:34:14 -08:00
Mark McLoughlin
2dfdfed8a0
[V0][Metrics] Deprecate some KV/prefix cache metrics ( #14136 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-03 18:25:46 +00:00
Mark McLoughlin
c41d27156b
[V0][Metrics] Remove unimplemented vllm:tokens_total ( #14134 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-03 17:50:22 +00:00
Harry Mellor
91373a0d15
Fix head_dim not existing in all model configs (Transformers backend) ( #14141 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-03 17:48:11 +00:00
TJian
848a6438ae
[ROCm] Faster Custom Paged Attention kernels ( #12348 )
2025-03-03 09:24:45 -08:00
Harry Mellor
98175b2816
Improve the docs for TransformersModel ( #14147 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-03 17:03:05 +00:00
Mark McLoughlin
4167252eaf
[V1] Refactor parallel sampling support ( #13774 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-03 08:15:27 -08:00
Cody Yu
f35f8e2242
[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 ( #13921 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-03 16:43:14 +08:00
Mengqing Cao
b87c21fc89
[Misc][Platform] Move use allgather to platform ( #14010 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-03-03 15:40:04 +08:00
wang.yuqi
e584b85afd
[Misc] duplicate code in deepseek_v2 ( #14106 )
2025-03-03 14:10:11 +08:00
Sheng Yao
09e56f9262
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure ( #14051 )
2025-03-02 17:35:01 -08:00
Harry Mellor
cf069aa8aa
Update deprecated Python 3.8 typing ( #13971 )
2025-03-02 17:34:51 -08:00
Ce Gao
bf33700ecd
[v0][structured output] Support reasoning output ( #12955 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-03-02 14:49:42 -05:00
qux-bbb
bc6ccb9878
[Doc] Source building add clone step ( #14086 )
...
Signed-off-by: qux-bbb <1147635419@qq.com >
2025-03-02 10:59:50 +00:00
Jun Duan
82fbeae92b
[Misc] Accurately capture the time of loading weights ( #14063 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-03-01 17:20:30 -08:00
Jee Jee Li
cc5e8f6db8
[Model] Add LoRA support for TransformersModel ( #13770 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-02 09:17:34 +08:00
Chen Zhang
d54990da47
[v1] Add __repr__ to KVCacheBlock to avoid recursive print ( #14081 )
2025-03-01 20:46:02 +00:00
Chen Zhang
b9f1d4294e
[v1][Bugfix] Only cache blocks that are not in the prefix cache ( #14073 )
2025-03-01 08:25:54 +00:00
Sage Moore
b28246f6ff
[ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class ( #14065 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-03-01 07:18:32 +00:00
Woosuk Kwon
3b5567a209
[V1][Minor] Do not print attn backend twice ( #13985 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-01 07:09:14 +00:00
Isotr0py
fdcc405346
[Doc] Consolidate whisper and florence2 examples ( #14050 )
2025-02-28 22:49:15 -08:00
Kuntai Du
8994dabc22
[Documentation] Add more deployment guide for Kubernetes deployment ( #13841 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-03-01 06:44:24 +00:00
Li, Jiang
02296f420d
[Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor ( #14053 )
2025-02-28 22:31:01 -08:00
YajieWang
6a92ff93e1
[Misc][Kernel]: Add GPTQAllSpark Quantization ( #12931 )
2025-02-28 22:30:59 -08:00
Jee Jee Li
6a84164add
[Bugfix] Add file lock for ModelScope download ( #14060 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-01 06:10:28 +00:00
Brayden Zhong
f64ffa8c25
[Docs] Add pipeline_parallel_size to optimization docs ( #14059 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-03-01 05:43:54 +00:00
Luka Govedič
bd56c983d6
[torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass ( #10902 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-02-28 16:20:11 -07:00
Rui Qiao
084bbac8cc
[core] Bump ray to 2.43 ( #13994 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-02-28 21:47:44 +00:00
Chen Zhang
28943d36ce
[v1] Move block pool operations to a separate class ( #13973 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-28 20:53:31 +00:00
Andrey Talman
b526ca6726
Add RELEASE.md ( #13926 )
...
Signed-off-by: atalman <atalman@fb.com >
2025-02-28 12:25:50 -08:00
Chen Zhang
e7bd944e08
[v1] Cleanup the BlockTable in InputBatch ( #13977 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-02-28 19:03:16 +00:00
iefgnoix
c3b6559a10
[V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU ( #13379 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-02-28 11:01:36 -07:00
Harry Mellor
4be4b26cb7
Fix entrypoint tests for embedding models ( #14052 )
2025-02-28 08:56:44 -08:00
Brayden Zhong
2aed2c9fa7
[Doc] Fix ROCm documentation ( #14041 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-02-28 16:42:07 +00:00
Yang Liu
9b61dd41e7
[Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series ( #14031 )
2025-02-28 07:36:08 -08:00
Cyrus Leung
f7bee5c815
[VLM][Bugfix] Enable specifying prompt target via index ( #14038 )
2025-02-28 07:35:55 -08:00
Jee Jee Li
e0734387fb
[Bugfix] Fix MoeWNA16Method activation ( #14024 )
2025-02-28 15:22:42 +00:00
Harry Mellor
f58f8b5c96
Update AutoAWQ docs ( #14042 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-28 15:20:29 +00:00
Thibault Schueller
b3f7aaccd0
[V1][Minor] Restore V1 compatibility with LLMEngine class ( #13090 )
2025-02-28 00:52:25 -08:00
Kacper Pietkun
b91660ddb8
[Hardware][Intel-Gaudi] Regional compilation support ( #13213 )
2025-02-28 00:51:49 -08:00
Harry Mellor
76c89fcadd
Use smaller embedding model when not testing model specifically ( #13891 )
2025-02-28 00:50:43 -08:00
Mathis Felardos
b9e41734c5
[Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) ( #13987 )
...
Signed-off-by: Mathis Felardos <mathis@mistral.ai >
2025-02-28 07:53:45 +00:00
Cyrus Leung
1088f06242
[Doc] Move multimodal Embedding API example to Online Serving page ( #14017 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-28 07:12:04 +00:00
Travis Johnson
73e0225ee9
[Bugfix] Check that number of images matches number of <|image|> tokens with mllama ( #13911 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-02-28 04:00:45 +00:00
Roger Wang
6c85da3a18
[V1]SupportsV0Only protocol for model definitions ( #13959 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-27 20:02:15 -05:00
Jee Jee Li
67fc426845
[Misc] Print FusedMoE detail info ( #13974 )
2025-02-27 18:53:13 -05:00
Benjamin Chislett
9804145cac
[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict ( #13626 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-02-27 15:28:08 -08:00
Lucas Wilkinson
2e94b9cfbb
[Attention] Flash MLA for V1 ( #13867 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Yang Chen <yangche@fb.com >
2025-02-27 23:03:41 +00:00
qli88
8294773e48
[core] Perf improvement for DSv3 on AMD GPUs ( #13718 )
...
Signed-off-by: qli88 <qiang.li2@amd.com >
2025-02-27 22:14:30 +00:00
Woosuk Kwon
cd813c6d4d
[V1][Minor] Minor cleanup for GPU Model Runner ( #13983 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-27 13:11:40 -08:00
Sage Moore
38acae6e97
[ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups ( #13970 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-02-27 20:31:47 +00:00
Cyrus Leung
a2dd48c386
[VLM] Deprecate legacy input mapper for OOT multimodal models ( #13979 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-27 19:14:55 +00:00
dependabot[bot]
126f6beeb4
Bump azure/setup-helm from 4.2.0 to 4.3.0 ( #13742 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-27 19:04:10 +00:00
Yang Chen
58d1b2aa77
[Attention] MLA support for V1 ( #13789 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-02-27 13:14:17 -05:00
Cyrus Leung
f1579b229d
[VLM] Generalized prompt updates for multi-modal processor ( #13964 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-27 17:44:25 +00:00
Isotr0py
7864875879
[Bugfix] Fix qwen2.5-vl overflow issue ( #13968 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-27 17:30:39 +00:00
Noam Gat
1dd422b64a
Update LMFE version to v0.10.11 to support new versions of transforme… ( #13930 )
2025-02-27 17:16:12 +00:00
Rui Qiao
06c8f8d885
[bugfix] Fix profiling for RayDistributedExecutor ( #13945 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-02-28 01:01:21 +08:00
Harry Mellor
5677c9bb3e
Deduplicate .pre-commit-config.yaml's exclude ( #13967 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-27 16:27:47 +00:00
王博伟
512d77d582
Update quickstart.md ( #13958 )
2025-02-27 16:05:11 +00:00
Szymon Ożóg
7f0be2aa24
[Model] Deepseek GGUF support ( #13167 )
2025-02-27 02:08:35 -08:00
Isotr0py
edf309ebbe
[VLM] Support multimodal inputs for Florence-2 models ( #13320 )
2025-02-27 02:06:41 -08:00
Michael Goin
788f284b53
Fix test_block_fp8.py test for MoE ( #13915 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-27 18:00:00 +08:00
Yang Zheng
4b1d141f49
[PP] Correct cache size check ( #13873 )
...
Signed-off-by: Yang Zheng <zhengy.gator@gmail.com >
2025-02-27 17:47:29 +08:00
Chauncey
10c3b8c1cf
[Misc] fixed 'required' is an invalid argument for positionals ( #13948 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-02-27 09:06:49 +00:00
Brayden Zhong
a7f37314b7
[CI/Build] Add examples/ directory to be labelled by mergify ( #13944 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-02-27 08:24:11 +00:00
Mark McLoughlin
cd711c48b2
[V1][Metrics] Handle preemptions ( #13169 )
2025-02-26 20:04:59 -08:00
Sage Moore
378b3ef6f8
[ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding ( #13922 )
2025-02-26 20:04:12 -08:00
Rui Qiao
c9944acbf9
[misc] Rename Ray ADAG to Compiled Graph ( #13928 )
2025-02-26 20:03:28 -08:00
Michael Goin
ca377cf1b9
Use CUDA 12.4 as default for release and nightly wheels ( #12098 )
2025-02-26 19:06:37 -08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
a31614e386
[ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined ( #13851 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2025-02-27 10:39:10 +08:00
Lucas Wilkinson
f95903909f
[Kernel] FlashMLA integration ( #13747 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-02-27 10:35:08 +08:00
Woosuk Kwon
b382a7f28f
[BugFix] Make FP8 Linear compatible with torch.compile ( #13918 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-26 13:48:55 -08:00
Wallas Henrique
4cb6fa0a9c
[Bugfix] Backend option to disable xgrammar any_whitespace ( #12744 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com >
2025-02-26 10:52:34 -08:00
Chauncey
d08b285adf
[Misc] fixed qwen_vl_utils parameter error ( #13906 )
2025-02-26 08:31:53 -08:00
Chenyaaang
b27122acc2
[TPU] use torch2.6 with whl package ( #13860 )
...
Signed-off-by: Chenyaaang <llccyy1212@gmail.com >
2025-02-26 08:18:54 -05:00
Cyrus Leung
934bb99c71
[Bugfix] Update expected token counts for Ultravox tests ( #13895 )
2025-02-26 04:56:50 -08:00
Joe Runde
3f808cc044
[Bugfix] Do not crash V0 engine on input errors ( #13101 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-02-26 19:07:29 +08:00
Brayden Zhong
ec8a5e5386
[Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor ( #13736 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-02-26 19:06:47 +08:00
Florian Greinacher
215bf150a6
[Bugfix] Handle None parameters in Mistral function calls. ( #13786 )
2025-02-26 03:06:21 -08:00
Harry Mellor
0ecdd98031
Add comments on accessing kv_cache and attn_metadata ( #13887 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-26 18:41:02 +08:00
Cyrus Leung
7b700ec8c8
[Bugfix] Add test example for Ultravox v0.5 ( #13890 )
2025-02-26 02:31:43 -08:00
Roger Wang
7ca1da020f
[Misc] Fix input processing for Ultravox ( #13871 )
2025-02-25 23:56:34 -08:00
Jee Jee Li
5157338ed9
[Misc] Improve LoRA spelling ( #13831 )
2025-02-25 23:43:01 -08:00
Seth Kimmel
e206b54331
[v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine ( #13837 )
...
Signed-off-by: Seth Kimmel <seth.kimmel3@gmail.com >
2025-02-26 14:58:24 +08:00
Sage Moore
1d35662e6d
[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms ( #13844 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-02-26 14:56:58 +08:00
Albert
e656f638de
[Doc] fix the incorrect module path of tensorize_vllm_model ( #13863 )
2025-02-25 22:56:19 -08:00
Harry Mellor
145944cb94
Improve pipeline partitioning ( #13839 )
2025-02-25 18:53:56 -08:00
Henry Tsang
094b7d9496
[Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues ( #13797 )
2025-02-25 18:52:03 -08:00
Chenguang Li
e1fe7591f2
[Misc]Code Cleanup ( #13859 )
...
Signed-off-by: noemotiovon <noemotiovon@gmail.com >
Co-authored-by: noemotiovon <noemotiovon@gmail.com >
2025-02-26 10:44:30 +08:00
Lily Liu
5629f26df7
[V1][Spec Decode] Change Spec Decode Rejection Sampling API ( #13729 )
2025-02-25 18:14:48 -08:00
Rui Qiao
9ba28043b5
[misc] Show driver IP info when Ray fails to allocate driver worker ( #13858 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-02-26 09:53:43 +08:00
Harry Mellor
24679788ed
DeepSeek V2/V3/R1 only place lm_head on last pp rank ( #13833 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-26 01:24:57 +00:00
Michael Goin
07c4353057
[Model] Support Grok1 ( #13795 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-26 01:07:12 +00:00
Harry Mellor
34e3494e70
Fix failing MyGemma2Embedding test ( #13820 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-25 12:33:03 -08:00
Liangfu Chen
f75aa72732
[Neuron] Add custom_ops for neuron backend ( #13246 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
Co-authored-by: George Novack <gnovack@amazon.com >
Co-authored-by: Aoyu Zhang <aoyuzhan@amazon.com >
2025-02-25 11:47:49 -08:00
Chen1022
340e39e387
Fix string parsing error ( #13825 )
2025-02-25 08:20:29 -08:00
Cyrus Leung
f4133ce4e5
[Bugfix] Revert inspection code in #13743 ( #13832 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-26 00:18:50 +08:00
Wen Sun
6522d55b6f
Fix /v1/audio/transcriptions Bad Request Error ( #13811 )
2025-02-25 06:03:33 -08:00
Isotr0py
6ff518626c
[Bugfix] Fix deepseek-vl2 inference with more than 2 images ( #13818 )
2025-02-25 06:03:02 -08:00
Nichols A. Romero
fa82074167
[Bugfix] Flush TunableOp results before worker processes are destroyed. ( #13623 )
...
Signed-off-by: Nichols A. Romero <nick.romero@amd.com >
2025-02-25 11:08:20 +00:00
Junlin Zhou
75e9d49796
[Bugfix] Initialize attention bias on the same device as Query/Key/Value ( #13468 )
2025-02-25 02:13:09 -08:00
Chen1022
32c3b6bfd1
[Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs ( #13724 )
...
Signed-off-by: Chen-0210 <chenjincong11@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-02-25 10:12:19 +00:00
Jee Jee Li
37b6cb4985
[CI/Build] Fix V1 LoRA failure ( #13767 )
2025-02-25 02:01:15 -08:00
Gregory Shtrasberg
aabeb2688f
[ROCm][Quantization][Kernel] Using HIP FP8 header ( #12593 )
2025-02-25 00:39:59 -08:00
Jiayi Yao
2f42a4888c
[Feature] Support KV cache offloading and disagg prefill with LMCache connector. ( #12953 )
2025-02-25 00:38:42 -08:00
Rui Qiao
3173c3b34e
[misc] Clean up ray compiled graph type hints ( #13731 )
2025-02-25 00:37:08 -08:00
Shanshan Shen
2d87d7d1ac
[Bugfix] Modify modelscope api usage in transformer_utils ( #13807 )
2025-02-25 00:36:07 -08:00
Russell Bryant
aab392774b
[Core] xgrammar: Expand list of unsupported jsonschema keywords ( #13783 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-25 08:21:25 +00:00
Cyrus Leung
6724e79164
[Misc] Check that the model can be inspected upon registration ( #13743 )
2025-02-25 00:18:19 -08:00
Varun Sundar Rabindranath
03f48b3db6
[Core] LoRA V1 - Add add/pin/list/remove_lora functions ( #13705 )
2025-02-25 00:18:02 -08:00
Michael Goin
4d251ad00e
Fix CompressedTensorsWNA16MoE with grouped scales ( #13769 )
2025-02-25 00:17:14 -08:00
Michael Goin
18e505930d
[Bugfix] Support MLA for CompressedTensorsWNA16 ( #13725 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-25 06:10:31 +00:00
Lucas Wilkinson
4a8cfc7551
[Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" ( #13802 )
2025-02-24 20:33:59 -08:00
Mark McLoughlin
bc32bc73aa
[V1][Metrics] Implement vllm:lora_requests_info metric ( #13504 )
2025-02-24 20:01:33 -08:00
wangxiyuan
ab1091d5f2
[Misc][Attention][Quantization] init property earlier ( #13733 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-02-25 03:19:30 +00:00
Tyler Michael Smith
1e15aaef56
[Bugfix][Quantization] Fix FP8 + EP ( #13784 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-25 10:54:17 +08:00
cjackal
51010a1807
[Misc] set single whitespace between log sentences ( #13771 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-02-25 10:26:12 +08:00
Eli Boyarski
7196a3b1db
[Doc] arg_utils.py: fixed a typo ( #13785 )
2025-02-24 18:23:04 -08:00
Harry Mellor
cdc1fa12eb
Remove unused kwargs from model definitions ( #13555 )
2025-02-24 17:13:52 -08:00
Robert Shaw
f61528d46d
[Misc][Chore] Clean Up AsyncOutputProcessing Logs ( #13780 )
2025-02-24 16:39:07 -08:00
Robert Shaw
1f0ae3ed0a
[Misc] Clean Up EngineArgs.create_engine_config ( #13734 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-02-24 13:52:21 -05:00
Michael Goin
db986c19ea
Fix precommit fail in fused_moe intermediate_cache2 chunking ( #13772 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-24 09:25:47 -08:00
Roger Wang
227578480d
Revert "[V1][Core] Fix memory issue with logits & sampling" ( #13775 )
2025-02-24 09:16:05 -08:00
afeldman-nm
befc402d34
[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) ( #10980 )
...
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-02-24 08:29:41 -08:00
Nicolò Lucchesi
444b0f0f62
[Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set ( #12513 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-02-24 10:43:21 -05:00
Zhonghua Deng
ccc00515fd
[BugFix] Illegal memory access for MoE On H20 ( #13693 )
2025-02-24 07:37:32 -08:00
Jongseok Park
781096e385
Expert Parallelism (EP) Support for DeepSeek V2 ( #12583 )
2025-02-24 07:33:20 -08:00
Roger Meier
7940d8a6a7
[CI/Build] add python-json-logger to requirements-common ( #12842 )
2025-02-24 06:10:33 -08:00
Roger Meier
c0e3ecd6d2
[Bugfix] fix(logging): add missing opening square bracket ( #13011 )
2025-02-24 06:10:25 -08:00
Mengqing Cao
23eca9cf68
[model][refactor] remove cuda hard code in models and layers ( #13658 )
2025-02-24 06:10:14 -08:00
Roger Wang
437b76ff59
[V1][Core] Fix memory issue with logits & sampling ( #13721 )
2025-02-24 06:10:06 -08:00
Kevin H. Luu
f90a375593
[ci] Add logic to change model to S3 path only when S3 CI env var is on ( #13727 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-63-253.us-west-2.compute.internal >
2025-02-24 06:32:11 +00:00
Huy Do
e7ef74e26e
Fix some issues with benchmark data output ( #13641 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-02-24 10:23:18 +08:00
Nick Hill
cbae7af552
[V1][BugFix] Fix engine core client shutdown hangs ( #13298 )
...
Even though ZMQ context.destroy() is meant to close open sockets before terminating the context, it appears to be necessary to do this explicitly or else it can hang in the context.term() method.
Close zmq sockets explicitly before terminating context, make shutdown of client resource more robust, shut down engine core process prior to terminating zmq context.
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-23 13:07:43 -08:00
youkaichao
eb24dc4a45
[v1] torchrun compatibility ( #13642 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-23 22:47:24 +08:00
Roger Wang
9bebc9512f
[Misc] Deprecate --dataset from benchmark_serving.py ( #13708 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-23 13:32:20 +00:00
Nick Hill
5a2ba16f5c
[Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms ( #13688 )
2025-02-23 02:54:29 -08:00
Isotr0py
ba5106e519
[LMM] Implement merged multimodal processor for whisper ( #13278 )
2025-02-23 01:46:03 -08:00
Kyle Sayers
d5ca2110f1
[Quant] BaiChuan SupportsQuant ( #13710 )
2025-02-22 19:21:15 -08:00
Kevin H. Luu
2c5e637b57
[ci] Use env var to control whether to use S3 bucket in CI ( #13634 )
2025-02-22 19:19:45 -08:00
Andy Lo
322d2a27d6
[BugFix] Minor: logger import in attention backend ( #13706 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-02-22 16:51:13 -08:00
Roger Wang
82e0d601fc
[CI/Build] Fix pre-commit errors from #13571 ( #13709 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-22 16:50:38 -08:00
Daniele
78ac0f591d
[CI/Build] fix uv caching in Dockerfile ( #13611 )
2025-02-22 08:25:20 -08:00
Yan Ma
b56155e7f3
[XPU]fix setuptools version for xpu ( #13548 )
2025-02-22 08:05:35 -08:00
Helena Kloosterman
382f66fb08
[Bugfix] Fix boolean conversion for OpenVINO env variable ( #13615 )
2025-02-22 08:04:12 -08:00
Cyrus Leung
8354f6640c
[Doc] Dockerfile instructions for optional dependencies and dev transformers ( #13699 )
2025-02-22 06:04:31 -08:00
Gregory Shtrasberg
c904fdddf6
[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm ( #13231 )
2025-02-22 05:54:38 -08:00
Sage Moore
558db8083c
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths ( #13095 )
2025-02-22 05:25:41 -08:00
Kaixi Hou
e109e598c7
[NVIDIA] Support nvfp4 cutlass gemm ( #13571 )
2025-02-22 05:24:05 -08:00
Keyun Tong
8db1b9d0a1
Support SSL Key Rotation in HTTP Server ( #13495 )
2025-02-22 05:17:44 -08:00
youkaichao
2382ad29d1
[ci] fix linter ( #13701 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-22 20:28:59 +08:00
youkaichao
3e472d882a
[core] set up data parallel communication ( #13591 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-22 19:28:59 +08:00
Cyrus Leung
7f6bae561c
[CI/Build] Fix pre-commit errors ( #13696 )
2025-02-22 00:31:26 -08:00
Jee Jee Li
105b8ce4c0
[Misc] Reduce LoRA-related static variable ( #13166 )
2025-02-22 00:21:30 -08:00
Mark McLoughlin
2cb8c1540e
[Metrics] Add --show-hidden-metrics-for-version CLI arg ( #13295 )
2025-02-22 00:20:45 -08:00
Mark McLoughlin
1cd981da4f
[V1][Metrics] Support vllm:cache_config_info ( #13299 )
2025-02-22 00:20:00 -08:00
Yu Chin Fabian Lim
fca20841c2
Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size ( #13660 )
2025-02-22 00:19:10 -08:00
Jennifer Zhao
da31b5333e
[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler ( #13594 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-02-22 00:08:29 -08:00
Lu Fang
bb78fb318e
[v1] Support allowed_token_ids in v1 Sampler ( #13210 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-22 14:13:05 +08:00
Robin
8aca27fa11
[Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len ( #13691 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-02-22 14:10:38 +08:00
Dipika Sikka
95c617e04b
[Misc] Bump compressed-tensors ( #13619 )
2025-02-21 22:09:04 -08:00
Shane A
9a1f1da5d1
[Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA ( #13687 )
2025-02-21 22:07:45 -08:00
Gordon Wong
68d630a0c7
[ROCM] fix native attention function call ( #13650 )
2025-02-21 22:07:04 -08:00
Jun Duan
68d535ef44
[Misc] Capture and log the time of loading weights ( #13666 )
2025-02-21 22:06:34 -08:00
Robin
c6ed93860f
[Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… ( #13672 )
2025-02-21 22:05:28 -08:00
Keyun Tong
0ffdf8ce0c
[HTTP Server] Make model param optional in request ( #13568 )
2025-02-21 21:55:50 -08:00
Yuan Tang
8c0dd3d4df
docs: Add a note on full CI run in contributing guide ( #13646 )
2025-02-21 21:53:59 -08:00
Isotr0py
ada7c780d5
[Misc] Fix yapf linting tools etc not running on pre-commit ( #13695 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-22 13:10:43 +08:00
Lucas Wilkinson
288cc6c234
[Attention] MLA with chunked prefill ( #12639 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Patrick Horn <patrick.horn@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-21 15:30:12 -08:00
John Zheng
900edbfa48
fix typo of grafana dashboard, with correct datasource ( #13668 )
...
Signed-off-by: John Zheng <john.zheng@hp.com >
2025-02-21 18:21:05 +00:00
Isotr0py
b2c3fc5d65
[Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation ( #13586 )
2025-02-20 22:24:17 -08:00
leoneo
839b27c6cc
[Kernel]Add streamK for block-quantized CUTLASS kernels ( #12978 )
2025-02-20 22:14:24 -08:00
Kevin H. Luu
34ad27fe83
[ci] Fix metrics test model path ( #13635 )
2025-02-20 22:12:10 -08:00
Gabriel Marinho
1c3c975766
[FEATURE] Enables /score endpoint for embedding models ( #12846 )
2025-02-20 22:09:47 -08:00
Szymon Ożóg
1cdc88614a
Missing comment explaining VDR variable in GGUF kernels ( #13290 )
2025-02-20 22:06:54 -08:00
Nick Hill
31aa045c11
[V1][Sampler] Avoid an operation during temperature application ( #13587 )
2025-02-20 22:05:56 -08:00
Roger Wang
a30c093502
[Bugfix] Add mm_processor_kwargs to chat-related protocols ( #13644 )
2025-02-20 22:04:33 -08:00
Harry Mellor
c7b07a95a6
Use pre-commit to update requirements-test.txt ( #13617 )
2025-02-20 22:03:27 -08:00
Kaixi Hou
27a09dc52c
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant ( #13632 )
2025-02-20 22:01:48 -08:00
Edwin Hernandez
981f3c831e
[Misc] Adding script to setup ray for multi-node vllm deployments ( #12913 )
2025-02-20 21:16:40 -08:00
Kante Yin
44c33f01f3
Add llmaz as another integration ( #13643 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com >
2025-02-21 03:52:40 +00:00
Lingfan Yu
33170081f1
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth ( #13245 )
...
Signed-off-by: Lingfan Yu <lingfany@amazon.com >
2025-02-20 17:45:45 -08:00
Michael Goin
71face8540
[Bugfix] Fix max_num_batched_tokens for MLA ( #13620 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-20 17:45:20 -08:00
Joe Runde
bfbc0b32c6
[Frontend] Add backend-specific options for guided decoding ( #13505 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-02-20 15:07:58 -05:00
ajayvohra2005
6a417b8600
fix neuron performance issue ( #13589 )
2025-02-20 10:59:36 -08:00
Woosuk Kwon
d3ea50113c
[V1][Minor] Print KV cache size in token counts ( #13596 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-20 09:24:31 -08:00
Harry Mellor
34aad515c8
Update pre-commit's isort version to remove warnings ( #13614 )
2025-02-20 08:00:14 -08:00
chenxiaobing
ed6e9075d3
[Bugfix] Fix deepseekv3 grouped topk error ( #13474 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: Chen-XiaoBing <chenxb002@whu.edu.cn >
2025-02-20 06:47:01 -08:00
Harry Mellor
992e5c3d34
Merge similar examples in offline_inference into single basic example ( #12737 )
2025-02-20 04:53:51 -08:00
Varun Sundar Rabindranath
b69692a2d8
[Kernel] LoRA - Refactor sgmv kernels ( #13110 )
2025-02-20 07:28:06 -05:00
Kevin H. Luu
a64a84433d
[2/n][ci] S3: Use full model path ( #13564 )
...
Signed-off-by: <>
2025-02-20 01:20:15 -08:00
Kevin H. Luu
aa1e62d0db
[ci] Fix spec decode test ( #13600 )
2025-02-20 16:56:00 +08:00
Michael Goin
497bc83124
[CI/Build] Use uv in the Dockerfile ( #13566 )
2025-02-19 23:05:44 -08:00
Yuan Tang
3738e6fa80
[API Server] Add port number range validation ( #13506 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-20 15:05:13 +08:00
Gregory Shtrasberg
0023cd2b9d
[ROCm] MI300A compile targets deprecation ( #13560 )
2025-02-19 23:05:00 -08:00
燃
041e294716
[Misc] add mm_processor_kwargs to extra_body for Qwen2.5-VL ( #13533 )
2025-02-19 23:04:30 -08:00
Alex Brooks
9621667874
[Misc] Warn if the vLLM version can't be retrieved ( #13501 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-02-20 06:24:48 +00:00
Simon Mo
8c755c3b6d
[bugfix] spec decode worker get tp group only when initialized ( #13578 )
2025-02-20 04:46:28 +00:00
youkaichao
ba81163997
[core] add sleep and wake up endpoint and v1 support ( #12987 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: cennn <2523403608@qq.com >
Co-authored-by: cennn <2523403608@qq.com >
2025-02-20 12:41:17 +08:00
Divakar Verma
0d243f2a54
[ROCm][MoE] mi300 mixtral8x7B perf for specific BS ( #13577 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-02-20 04:01:02 +00:00
Kevin H. Luu
88f6ba3281
[ci] Add AWS creds for AMD ( #13572 )
2025-02-20 03:56:06 +00:00
Jee Jee Li
512368e34a
[Misc] Qwen2.5 VL support LoRA ( #13261 )
2025-02-19 18:37:55 -08:00
Kevin H. Luu
473f51cfd9
[3/n][CI] Load Quantization test models with S3 ( #13570 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-20 10:12:30 +08:00
Nick Hill
a4c402a756
[BugFix] Avoid error traceback in logs when V1 LLM terminates ( #13565 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-20 00:49:01 +00:00
Isotr0py
550d97eb58
[Misc] Avoid calling unnecessary hf_list_repo_files for local model path ( #13348 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
2025-02-19 18:57:48 +00:00
Cody Yu
fbbe1fbac6
[MISC] Logging the message about Ray teardown ( #13502 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
2025-02-19 09:40:50 -08:00
Wilson Wu
01c184b8f3
Fix copyright year to auto get current year ( #13561 )
2025-02-19 16:55:34 +00:00
youkaichao
ad5a35c21b
[doc] clarify multi-node serving doc ( #13558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-19 22:32:17 +08:00
shangmingc
5ae9f26a5a
[Bugfix] Fix device ordinal for multi-node spec decode ( #13269 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-19 22:13:15 +08:00
Cyrus Leung
377d10bd14
[VLM][Bugfix] Pass processor kwargs properly on init ( #13516 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-19 13:13:50 +00:00
youkaichao
52ce14d31f
[doc] clarify profiling is only for developers ( #13554 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-19 20:55:58 +08:00
Daniele
81dabf24a8
[CI/Build] force writing version file ( #13544 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2025-02-19 18:48:03 +08:00
Yannick Schnider
423330263b
[Feature] Pluggable platform-specific scheduler ( #13161 )
...
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com >
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com >
2025-02-19 17:16:38 +08:00
Nick Hill
caf7ff4456
[V1][Core] Generic mechanism for handling engine utility ( #13060 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-19 17:09:22 +08:00
Lucia Fang
f525c0be8b
[Model][Speculative Decoding] DeepSeek MTP spec decode ( #12755 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-02-19 17:06:23 +08:00
Alex Brooks
983a40a8bb
[Bugfix] Fix Positive Feature Layers in Llava Models ( #13514 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-02-19 08:50:07 +00:00
Zhe Zhang
fdc5df6f54
use device param in load_model method ( #13037 )
2025-02-19 16:05:02 +08:00
Kevin H. Luu
3b05cd4555
[perf-benchmark] Fix ECR path for premerge benchmark ( #13512 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 07:56:11 +00:00
Kevin H. Luu
d5d214ac7f
[1/n][CI] Load models in CI from S3 instead of HF ( #13205 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 07:34:59 +00:00
Roger Wang
fd84857f64
[Doc] Add clarification note regarding paligemma ( #13511 )
2025-02-18 22:24:03 -08:00
Divakar Verma
8aada19dfc
[ROCm][MoE configs] mi325 mixtral & mi300 qwen_moe ( #13503 )
2025-02-18 22:23:24 -08:00
Kevin H. Luu
9aa95b0e6a
[perf-benchmark] Allow premerge ECR ( #13509 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 05:13:41 +00:00
Yu-Zhou
d0a7a2769d
[Hardware][Gaudi][Feature] Support Contiguous Cache Fetch ( #12139 )
...
Signed-off-by: yuzhou <yuzhou@habana.ai >
Signed-off-by: zhouyu5 <yu.zhou@intel.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-18 19:40:19 -08:00
Harry Mellor
00b69c2d27
[Misc] Remove dangling references to --use-v2-block-manager ( #13492 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-19 03:37:26 +00:00
Woosuk Kwon
4c82229898
[V1][Spec Decode] Optimize N-gram matching with Numba ( #13365 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 13:19:58 -08:00
Woosuk Kwon
c8d70e2437
Pin Ray version to 2.40.0 ( #13490 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 12:50:31 -08:00
Nick Hill
30172b4947
[V1] Optimize handling of sampling metadata and req_ids list ( #13244 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-18 12:15:33 -08:00
Murali Andoorveedu
a4d577b379
[V1][Tests] Adding additional testing for multimodal models to V1 ( #13308 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2025-02-18 09:53:14 -08:00
youkaichao
7b203b7694
[misc] fix debugging code ( #13487 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 09:37:11 -08:00
Woosuk Kwon
4fb8142a0e
[V1][PP] Enable true PP with Ray executor ( #13472 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 09:15:32 -08:00
Daniele
a02c86b4dd
[CI/Build] migrate static project metadata from setup.py to pyproject.toml ( #8772 )
2025-02-18 08:02:49 -08:00
Liangfu Chen
3809458456
[Bugfix] Fix invalid rotary embedding unit test ( #13431 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-02-18 11:52:03 +00:00
zifeitong
d3231cb436
[Bugfix] Handle content type with optional parameters ( #13383 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2025-02-18 11:29:13 +00:00
Cyrus Leung
435b502a6e
[ROCm] Make amdsmi import optional for other platforms ( #13460 )
2025-02-18 03:15:56 -08:00
Isotr0py
29fc5772c4
[Bugfix] Remove noisy error logging during local model loading ( #13458 )
2025-02-18 03:15:48 -08:00
Harry Mellor
2358ca527b
[Doc]: Improve feature tables ( #13224 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-18 18:52:39 +08:00
Isotr0py
8cf97f8661
[Bugfix] Fix failing transformers dynamic module resolving with spawn multiproc method ( #13403 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-18 10:25:53 +00:00
Yuan Tang
e2603fefb8
[Bugfix] Ensure LoRA path from the request can be included in err msg ( #13450 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-18 16:19:15 +08:00
Michael Goin
b53d79983c
Add outlines fallback when JSON schema has enum ( #13449 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-18 06:49:41 +00:00
Woosuk Kwon
9915912f7f
[V1][PP] Fix & Pin Ray version in requirements-cuda.txt ( #13436 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 21:58:06 -08:00
Kyle Sayers
d1b649f1ef
[Quant] Aria SupportsQuant ( #13416 )
2025-02-17 21:51:09 -08:00
youkaichao
ac19b519ed
[core] fix sleep mode in pytorch 2.6 ( #13456 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 13:48:10 +08:00
Yuan Tang
a1074b3efe
[Bugfix] Only print out chat template when supplied ( #13444 )
2025-02-17 21:43:31 -08:00
Kyle Sayers
00294e1bc6
[Quant] Arctic SupportsQuant ( #13366 )
2025-02-17 21:35:09 -08:00
Kyle Sayers
88787bce1d
[Quant] Molmo SupportsQuant ( #13336 )
2025-02-17 21:34:47 -08:00
youkaichao
932b51cedd
[v1] fix parallel config rank ( #13445 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 12:33:45 +08:00
Divakar Verma
7c7adf81fc
[ROCm] fix get_device_name for rocm ( #13438 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-02-18 04:07:12 +00:00
Isotr0py
67ef8f666a
[Model] Enable quantization support for transformers backend ( #12960 )
2025-02-17 19:52:47 -08:00
Harry Mellor
efbe854448
[Misc] Remove dangling references to SamplingType.BEAM ( #13402 )
2025-02-17 19:52:35 -08:00
Tyler Michael Smith
b3942e157e
[Bugfix][CI][V1] Work around V1 + CUDA Graph + torch._scaled_mm fallback issue ( #13425 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-18 00:32:48 +00:00
Woosuk Kwon
cd4a72a28d
[V1][Spec decode] Move drafter to model runner ( #13363 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 15:40:12 -08:00
Cody Yu
6ac485a953
[V1][PP] Fix intermediate tensor values ( #13417 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-17 13:37:45 -08:00
Woosuk Kwon
4c21ce9eba
[V1] Get input tokens from scheduler ( #13339 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 11:01:07 -08:00
r.4ntix
ce77eb9410
[Bugfix] Fix VLLM_USE_MODELSCOPE issue ( #13384 )
2025-02-17 14:22:01 +00:00
Yan Ma
30513d1cb6
[Bugfix] fix xpu communicator ( #13368 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-02-17 20:59:18 +08:00
Tyler Michael Smith
1f69c4a892
[Model] Support Mamba2 (Codestral Mamba) ( #9292 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-02-17 20:17:50 +08:00
Cyrus Leung
7b623fca0b
[VLM] Check required fields before initializing field config in DictEmbeddingItems ( #13380 )
2025-02-17 01:36:07 -08:00
Mengqing Cao
238dfc8ac3
[MISC] tiny fixes ( #13378 )
2025-02-17 00:57:13 -08:00
Huy Do
45186834a0
Run v1 benchmark and integrate with PyTorch OSS benchmark database ( #13068 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-02-17 08:16:32 +00:00
yankooo
f857311d13
Fix spelling error in index.md ( #13369 )
2025-02-17 06:53:20 +00:00
shangmingc
46cdd59577
[Feature][Spec Decode] Simplify the use of Eagle Spec Decode ( #12304 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-16 19:32:26 -08:00
Jee Jee Li
2010f04c17
[V1][Misc] Avoid unnecessary log output ( #13289 )
2025-02-16 19:26:24 -08:00
Woosuk Kwon
69e1d23e1e
[V1][BugFix] Clean up rejection sampler & Fix warning msg ( #13362 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 12:25:29 -08:00
Isotr0py
d67cc21b78
[Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend edge case ( #13358 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-16 18:55:27 +00:00
Woosuk Kwon
e18227b04a
[V1][PP] Cache Intermediate Tensors ( #13353 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 10:02:27 -08:00
Woosuk Kwon
7b89386553
[V1][BugFix] Add __init__.py to v1/spec_decode/ ( #13359 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 09:39:08 -08:00
凌
da833b0aee
[Docs] Change myenv to vllm. Update python_env_setup.inc.md ( #13325 )
2025-02-16 16:04:21 +00:00
Cyrus Leung
5d2965b7d7
[Bugfix] Fix 2 Node and Spec Decode tests ( #13341 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-16 22:20:22 +08:00
youkaichao
a0231b7c25
[platform] add base class for communicators ( #13208 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-16 22:14:22 +08:00
youkaichao
124776ebd5
[ci] skip failed tests for flashinfer ( #13352 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-16 22:09:15 +08:00
Roger Wang
b7d309860e
[V1] Update doc and examples for H2O-VL ( #13349 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-16 10:35:54 +00:00
wchen61
dc0f7ccf8b
[BugFix] Enhance test_pos_encoding to support execution on multi-devices ( #13187 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2025-02-16 08:59:49 +00:00
Michael Goin
d3d547e057
[Bugfix] Pin xgrammar to 0.1.11 ( #13338 )
2025-02-15 19:42:25 -08:00
Kyle Sayers
12913d17ba
[Quant] Add SupportsQuant to phi3 and clip ( #13104 )
2025-02-15 19:28:33 -08:00
Lily Liu
80f63a3966
[V1][Spec Decode] Ngram Spec Decode ( #12193 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-02-15 18:05:11 -08:00
Cyrus Leung
367cb8ce8c
[Doc] [2/N] Add Fuyu E2E example for multimodal processor ( #13331 )
2025-02-15 07:06:23 -08:00
youkaichao
54ed913f34
[ci/build] update flashinfer ( #13323 )
2025-02-15 05:33:13 -08:00
Cody Yu
9206b3d7ec
[V1][PP] Run engine busy loop with batch queue ( #13064 )
2025-02-15 03:59:01 -08:00
rasmith
ed0de3e4b8
[AMD] [Model] DeepSeek tunings ( #13199 )
2025-02-15 03:58:09 -08:00
Mark McLoughlin
2ad1bc7afe
[V1][Metrics] Add iteration_tokens_total histogram from V0 ( #13288 )
2025-02-15 03:56:19 -08:00
Isotr0py
7fdaaf48ef
[Bugfix] Fix qwen2.5-vl image processor ( #13286 )
2025-02-15 03:00:11 -08:00
Xu Song
067fa2255b
[Bugfix]Fix search start_index of stop_checker ( #13280 )
2025-02-14 21:39:42 -08:00
Nick Hill
9076325677
[BugFix] Don't scan entire cache dir when loading model ( #13302 )
2025-02-14 21:33:31 -08:00
Tyler Michael Smith
97a3d6d995
[Bugfix] Massage MLA's usage of flash attn for RoCM ( #13310 )
2025-02-14 21:33:25 -08:00
Nicolò Lucchesi
579d7a63b2
[Bugfix][Docs] Fix offline Whisper ( #13274 )
2025-02-14 21:32:37 -08:00
Sage Moore
c9f9d5b397
[Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm ( #13235 )
2025-02-14 20:30:42 -08:00
Woosuk Kwon
0c73026844
[V1][PP] Fix memory profiling in PP ( #13315 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-14 20:17:25 -08:00
Nick Hill
6a854c7a2b
[V1][Sampler] Don't apply temp for greedy-only ( #13311 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-14 18:10:53 -08:00
Woosuk Kwon
e7eea5a520
[V1][CI] Fix failed v1-test because of min_p ( #13316 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-14 17:29:51 -08:00
Aoyu
a12934d3ec
[V1][Core] min_p sampling support ( #13191 )
...
Signed-off-by: Aoyu <aoyuzhan@amazon.com >
Co-authored-by: Aoyu <aoyuzhan@amazon.com >
2025-02-14 15:50:05 -08:00
Joe Runde
3bcb8c75da
[Core] Reduce TTFT with concurrent partial prefills ( #10235 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-14 15:36:07 -08:00
Michael Goin
5e5c8e091e
[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts ( #13236 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-14 12:53:42 -08:00
Yu-Zhou
c9e2d644e7
[Hardware][Gaudi][Bugfix] Fix error for guided decoding ( #12317 )
2025-02-14 04:36:49 -08:00
Russell Bryant
7734e9a291
[Core] choice-based structured output with xgrammar ( #12632 )
2025-02-14 04:36:05 -08:00
Lu Fang
6224a9f620
Support logit_bias in v1 Sampler ( #13079 )
2025-02-14 04:34:59 -08:00
Nick Hill
085b7b2d6c
[V1] Simplify GPUModelRunner._update_states check ( #13265 )
2025-02-14 04:33:43 -08:00
Cyrus Leung
4da1f667e9
[VLM] Keep track of whether prompt replacements have been applied ( #13215 )
2025-02-14 04:20:46 -08:00
Jun Duan
556ef7f714
[Misc] Log time consumption of sleep and wake-up ( #13115 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-02-14 20:10:21 +08:00
Xu Song
83481ceb49
[Bugfix] Fix missing parentheses ( #13263 )
2025-02-14 01:07:10 -08:00
Pooya Davoodi
185cc19f92
[Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch ( #12927 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-02-14 08:22:42 +00:00
Alexander Matveev
45f90bcbba
[WIP] TPU V1 Support Refactored ( #13049 )
2025-02-14 00:21:53 -08:00
Kero Liang
b0ccfc565a
[Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch ( #13126 )
2025-02-13 22:39:20 -08:00
Sage Moore
ba59b78a9c
[ROCm][V1] Add intial ROCm support to V1 ( #12790 )
2025-02-13 22:21:50 -08:00
Varun Sundar Rabindranath
cbc40128eb
[V1] LoRA - Enable Serving Usecase ( #12883 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-14 14:21:12 +08:00
Michael Goin
f0b2da72a8
Expand MLA to support most types of quantization ( #13181 )
2025-02-13 22:19:22 -08:00
Harry Mellor
f2b20fe491
Consolidate Llama model usage in tests ( #13094 )
2025-02-13 22:18:03 -08:00
Wang Ran (汪然)
40932d7a05
[Misc] Remove redundant statements in scheduler.py ( #13229 )
2025-02-13 22:07:25 -08:00
XiaobingZhang
84683fa271
[Bugfix] Offline example of disaggregated prefill ( #13214 )
2025-02-13 20:20:47 -08:00
Tyler Michael Smith
067678262a
[Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config ( #13237 )
2025-02-13 20:19:43 -08:00
Tyler Michael Smith
09545c0a94
[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on ( #13250 )
2025-02-13 20:19:25 -08:00
Roger Wang
dd5ede4440
[V1] Consolidate MM cache size to vllm.envs ( #13239 )
2025-02-13 20:19:03 -08:00
Jinzhen Lin
8c32b08a86
[Kernel] Fix awq error when n is not divisable by 128 ( #13227 )
2025-02-13 20:07:05 -08:00
Gregory Shtrasberg
410886950a
[ROCm] Avoid using the default stream on ROCm ( #13238 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-02-14 09:29:26 +08:00
Harry Mellor
e38be640e6
Revert "Add label if pre-commit passes" ( #13242 )
2025-02-13 16:12:32 -08:00
Tyler Michael Smith
c1e37bf71b
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels ( #13198 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-14 00:01:14 +00:00
Michael Goin
2344192a55
Optimize moe_align_block_size for deepseek_v3 ( #12850 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-13 18:43:37 -05:00
Harry Mellor
bffddd9a05
Add label if pre-commit passes ( #12527 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-13 20:51:30 +00:00
Nicolò Lucchesi
d84cef76eb
[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint ( #12909 )
2025-02-13 07:23:45 -08:00
Vaibhav Jain
37dfa60037
[Bugfix] Missing Content Type returns 500 Internal Server Error ( #13193 )
2025-02-13 06:52:22 -08:00
Cyrus Leung
1bc3b5e71b
[VLM] Separate text-only and vision variants of the same model architecture ( #13157 )
2025-02-13 06:19:15 -08:00
燃
02ed8a1fbe
[Misc] Qwen2.5-VL Optimization ( #13155 )
2025-02-13 06:17:57 -08:00
Aoyu
2092a6fa7d
[V1][Core] Add worker_base for v1 worker ( #12816 )
...
Signed-off-by: Aoyu <aoyuzhan@amazon.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Aoyu <aoyuzhan@amazon.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-02-13 20:35:18 +08:00
Cyrus Leung
c9d3ecf016
[VLM] Merged multi-modal processor for Molmo ( #12966 )
2025-02-13 04:34:00 -08:00
Roger Wang
fdcf64d3c6
[V1] Clarify input processing and multimodal feature caching logic ( #13211 )
2025-02-13 03:43:24 -08:00
Russell Bryant
578087e56c
[Frontend] Pass pre-created socket to uvicorn ( #13113 )
2025-02-13 00:51:46 -08:00
Isotr0py
fa253f1a70
[VLM] Remove input processor from clip and siglip ( #13165 )
2025-02-13 00:31:37 -08:00
Rui Qiao
9605c1256e
[V1][core] Implement pipeline parallel on Ray ( #12996 )
2025-02-13 08:02:46 +00:00
Russell Bryant
0ccd8769fb
[CI/Build] Allow ruff to auto-fix some issues ( #13180 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-13 07:45:38 +00:00
Daniel Han
cb944d5818
Allow Unsloth Dynamic 4bit BnB quants to work ( #12974 )
2025-02-12 23:13:08 -08:00
Russell Bryant
d46d490c27
[Frontend] Move CLI code into vllm.cmd package ( #12971 )
2025-02-12 23:12:21 -08:00
LikeSundayLikeRain
04f50ad9d1
[Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case ( #13097 )
2025-02-12 23:11:26 -08:00
Cody Yu
60c68df6d1
[Build] Automatically use the wheel of the base commit with Python-only build ( #13178 )
2025-02-12 23:10:28 -08:00
Lu Fang
009439caeb
Simplify logic of locating CUDART so file path ( #13203 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-13 13:52:41 +08:00
Isotr0py
bc55d13070
[VLM] Implement merged multimodal processor for Mllama ( #11427 )
2025-02-12 20:26:21 -08:00
Michael Goin
d88c8666a1
[Bugfix][Example] Fix GCed profiling server for TPU ( #12792 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-02-13 11:52:11 +08:00
Kaixi Hou
4fc5c23bb6
[NVIDIA] Support nvfp4 quantization ( #12784 )
2025-02-12 19:51:51 -08:00
Kevin H. Luu
9f9704dca6
[perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance ( #12706 )
2025-02-12 19:51:33 -08:00
Russell Bryant
8eafe5eaea
[CI/Build] Ignore ruff warning up007 ( #13182 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-13 11:48:31 +08:00
Murali Andoorveedu
4c0d93f4b2
[V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort ( #13173 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2025-02-12 12:58:11 -08:00
Michael Goin
14b7899d10
[CI] Fix failing FP8 cpu offload test ( #13170 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-12 19:16:06 +00:00
Michael Goin
09972e716c
[Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity ( #13119 )
2025-02-12 09:19:53 -08:00
Qubitium-ModelCloud
36a08630e8
[CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control ( #7086 )
2025-02-12 09:19:43 -08:00
Russell Bryant
2c2b560f48
[CI/Build] Use mypy matcher for pre-commit CI job ( #13162 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-12 17:12:22 +00:00
Lu Fang
042c3419fa
Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path ( #12998 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-12 09:06:13 -08:00
Jee Jee Li
82cabf53a3
[Misc] Delete unused LoRA modules ( #13151 )
2025-02-12 08:58:24 -08:00
Rafael Vasquez
314cfade02
[Frontend] Generate valid tool call IDs when using tokenizer-mode=mistral ( #12332 )
2025-02-12 08:29:56 -08:00
Cyrus Leung
985b4a2b19
[Bugfix] Fix num video tokens calculation for Qwen2-VL ( #13148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-12 11:55:23 +00:00
bnellnm
f4d97e4fc2
[Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request ( #13108 )
2025-02-12 02:39:16 -08:00
Shiyan Deng
f1042e86f0
[Misc] AMD Build Improvements ( #12923 )
2025-02-12 02:36:10 -08:00
Maximilien de Bayser
7c4033acd4
Further reduce the HTTP calls to huggingface.co ( #13107 )
2025-02-12 02:34:09 -08:00
dependabot[bot]
d59def4730
Bump actions/setup-python from 5.3.0 to 5.4.0 ( #12672 )
2025-02-12 16:41:22 +08:00
dependabot[bot]
0c7d9effce
Bump helm/chart-testing-action from 2.6.1 to 2.7.0 ( #12463 )
2025-02-12 16:41:06 +08:00
dependabot[bot]
dd3b4a01f8
Bump actions/stale from 9.0.0 to 9.1.0 ( #12462 )
2025-02-12 00:40:25 -08:00
dependabot[bot]
a0597c6b75
Bump helm/kind-action from 1.10.0 to 1.12.0 ( #11612 )
2025-02-12 00:40:19 -08:00
Lingfan Yu
e92694b6fe
[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency ( #12921 )
...
Signed-off-by: Lingfan Yu <lingfany@amazon.com >
2025-02-11 21:12:37 -08:00
Kevin H. Luu
842b0fd402
[ci] Add more source file dependencies for some tests ( #13123 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-11 20:38:10 -08:00
Christian Pinto
974dfd4971
[Model] IBM/NASA Prithvi Geospatial model ( #12830 )
2025-02-11 20:34:30 -08:00
Keyun Tong
3ee696a63d
[RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM ( #12518 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-02-12 12:25:58 +08:00
Russell Bryant
72c2b68dc9
[Misc] Move pre-commit suggestion back to the end ( #13114 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-11 22:34:16 +00:00
Yuan Tang
14ecab5be2
[Bugfix] Guided decoding falls back to outlines when fails to import xgrammar ( #12976 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-11 18:17:44 +00:00
Harry Mellor
deb6c1c6b4
[Doc] Improve OpenVINO installation doc ( #13102 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-11 18:02:46 +00:00
Li, Jiang
565c1efa65
[CI/Build][Bugfix] Fix CPU backend default threads num ( #13077 )
2025-02-11 16:55:56 +00:00
Szymon Ożóg
2b25b7d2e1
Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 ( #13023 )
2025-02-11 08:38:48 -08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
6c4dbe23eb
[BugFix] Pop instead of del CUDA_VISIBLE_DEVICES ( #12962 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2025-02-12 00:21:50 +08:00
MoonRide303
21f5d50fa5
[Bugfix] Do not use resource module on Windows ( #12858 ) ( #13029 )
2025-02-11 08:21:18 -08:00
Jewon Lee
bf3e05215c
[Misc] Fix typo at comments at metrics.py ( #13024 )
2025-02-11 08:20:37 -08:00
Harry Mellor
ad9776353e
Set torch_dtype in TransformersModel ( #13088 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-11 23:51:19 +08:00
Mark McLoughlin
75e6e14516
[V1][Metrics] Add several request timing histograms ( #12644 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-02-11 10:14:00 -05:00
மனோஜ்குமார் பழனிச்சாமி
110f59a33e
[Bugfix] fix flaky test ( #13089 )
...
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
2025-02-11 14:41:20 +00:00
wangxiyuan
2e3b969ec0
[Platform] add pre_register_and_update function ( #12432 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-02-11 22:06:46 +08:00
Yuhong Guo
da317197dd
[Build] Fix cuda link target of cumem_allocator in CPU env ( #12863 )
...
Signed-off-by: YuhongGuo <yuhong.gyh@antgroup.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-11 21:55:57 +08:00
Gregory Shtrasberg
7539bbc6a6
[ROCm] Using a more precise memory profiling ( #12624 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-02-11 21:47:10 +08:00
Mengqing Cao
9cf4759493
[executor] init local_rank as device index ( #13027 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-02-11 21:20:53 +08:00
Cody Yu
41c5dd45b9
[V1][Metrics] Add GPU prefix cache hit rate % gauge ( #12592 )
2025-02-11 08:27:25 +00:00
Ce Gao
fc6485d277
[Bugfix]: Reasoning output bug according to the chat template change ( #13025 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-02-11 15:49:03 +08:00
Varun Sundar Rabindranath
78a141d768
[Misc] LoRA - Refactor Punica ops tests ( #12970 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-11 07:26:03 +00:00
Russell Bryant
c320ca8edd
[Core] Don't do platform detection at import time ( #12933 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-11 07:25:25 +00:00
Woosuk Kwon
58047c6f04
[Benchmark] Add BurstGPT to benchmark_serving ( #13063 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-02-10 21:25:30 -08:00
Florian Greinacher
cb080f32e3
[Bugfix] Support missing tool parameters in mistral tokenizer ( #12884 )
...
Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com >
2025-02-11 03:33:33 +00:00
Simon Mo
2c0f58203c
[Docs] Annouce Meta Meetup ( #13065 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-02-10 18:24:29 -08:00
Woosuk Kwon
2ff4857678
[V1][Minor] Move scheduler outputs to a separate file ( #13062 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-11 02:10:06 +00:00
Kevin H. Luu
91e876750e
[misc] Fix setup.py condition to avoid AMD from being mistaken with CPU ( #13022 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2025-02-10 18:06:16 -08:00
Farzad Abdolhosseini
08b2d845d6
[Model] Ultravox Model: Support v0.5 Release ( #12912 )
...
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai >
2025-02-10 22:02:48 +00:00
மனோஜ்குமார் பழனிச்சாமி
2ae889052c
Fix seed parameter behavior in vLLM ( #13007 )
...
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
2025-02-10 23:26:50 +08:00
Cyrus Leung
51f0b5f7f6
[Bugfix] Clean up and fix multi-modal processors ( #13012 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-10 10:45:21 +00:00
Kevin H. Luu
fde71262e0
[misc] Add retries with exponential backoff for HF file existence check ( #13008 )
2025-02-10 01:15:02 -08:00
Yuan Tang
243137143c
[Doc] Add link to tool_choice tracking issue in tool_calling.md ( #13003 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-10 06:09:33 +00:00
youkaichao
b2496bb07f
[core] fix sleep mode and pytorch checkpoint compatibility ( #13001 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 13:03:43 +08:00
Yuan Tang
44607e07d3
Check if selected backend is None in get_attn_backend_cls() ( #12975 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-10 11:45:07 +08:00
Nick Hill
67c4637ccf
[V1] Use msgpack for core request serialization ( #12918 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-10 11:35:56 +08:00
youkaichao
aa0ca5ebb7
[core][rlhf] add colocate example for RLHF ( #12984 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 10:28:59 +08:00
youkaichao
59fff4a01a
[core] improve error handling when wake up from sleep mode ( #12981 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 09:38:57 +08:00
Lu Fang
29f1d47e73
[MISC] Always import version library first in the vllm package ( #12979 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-09 18:56:40 +08:00
youkaichao
cf797aa856
[core] port pynvml into vllm codebase ( #12963 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-09 15:00:00 +08:00
Woosuk Kwon
24700c346b
[V1] Cache uses_mrope in GPUModelRunner ( #12969 )
2025-02-08 15:32:32 -08:00
Patrick von Platen
d366ccc4e3
[RFC] [Mistral] FP8 format ( #10130 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-02-08 14:12:53 -07:00
Woosuk Kwon
870c37481e
[V1][Minor] Remove outdated comment ( #12968 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-08 12:48:30 -08:00
Jee Jee Li
86222a3dab
[VLM] Merged multi-modal processor for GLM4V ( #12449 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-08 20:32:16 +00:00
youkaichao
fe743b798d
[bugfix] fix early import of flash attention ( #12959 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-09 00:06:56 +08:00
shangmingc
913df14da3
[Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU ( #12935 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-08 14:46:19 +00:00
Cyrus Leung
8a69e0e20e
[CI/Build] Auto-fix Markdown files ( #12941 )
2025-02-08 04:25:15 -08:00
Isotr0py
4c8dd12ef3
[Misc] Add qwen2.5-vl BNB support ( #12944 )
2025-02-08 04:24:47 -08:00
Jun Duan
256a2d29dc
[Doc] Correct HF repository for TeleChat2 models ( #12949 )
2025-02-08 01:42:15 -08:00
Liangfu Chen
c45d398e6f
[CI] Resolve transformers-neuronx version conflict ( #12925 )
2025-02-08 01:41:35 -08:00
Jun Duan
011e612d92
[Misc] Log time consumption on weight downloading ( #12926 )
2025-02-08 09:16:42 +00:00
Varun Sundar Rabindranath
7e1837676a
[misc] Add LoRA to benchmark_serving ( #12898 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-08 17:15:44 +08:00
Sanju C Sudhakaran
2880e21e3d
[Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi ( #12812 )
...
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
2025-02-08 17:15:30 +08:00
wangxiyuan
407b5537db
[Build] Make pypi install work on CPU platform ( #12874 )
2025-02-08 01:15:15 -08:00
Woosuk Kwon
4ea48fb35c
[V1][Minor] Move cascade attn logic outside _prepare_inputs ( #12943 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-08 00:39:09 -08:00
Shaoting
e31498bdcb
[Misc] Add offline test for disaggregated prefill ( #12418 )
2025-02-08 08:38:20 +00:00
youkaichao
91dd8f7aa6
[bugfix] respect distributed_executor_backend in world_size=1 ( #12934 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-08 16:17:08 +08:00
zifeitong
d01f66b039
[Bugfix] Fix multi-round chat error when mistral tokenizer is used ( #12859 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-02-08 07:04:34 +00:00
Ke Zhao
cc01223f3b
[Misc] Fix typo in the example file ( #12896 )
...
Signed-off-by: Zhao Ke <yingxiongraomingzk@gmail.com >
2025-02-08 06:56:43 +00:00
Jee Jee Li
306923da82
[Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping ( #12905 )
2025-02-07 21:02:53 -08:00
Woosuk Kwon
3243158336
[V1] Move KV block hashes from Request to KVCacheManager ( #12922 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-07 19:14:10 -08:00
Woosuk Kwon
b21f0f9d17
[V1][Minor] Remove outdated comment ( #12928 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-07 19:07:37 -08:00
Lu Fang
45cbc4991d
[Bugfix] Fix disagg hang caused by the prefill and decode communication issues ( #12723 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-07 16:39:50 -08:00
Robert Shaw
932c6b7461
[V1] LM Eval With Streaming Integration Tests ( #11590 )
2025-02-07 15:07:03 -08:00
TJian
eaa92d4437
[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing ( #12501 )
2025-02-07 08:13:43 -08:00
afeldman-nm
0630d4537a
[V1] Logprobs and prompt logprobs support ( #9880 )
...
This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.
New behavior:
- During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order.
- In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized.
- During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.)
- Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer.
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-02-07 07:26:20 -08:00
Amit Garg
538fab93cd
PR #12718 ( #12718 )
2025-02-07 06:22:37 -08:00
Cyrus Leung
ce26b16268
[Misc] Remove unnecessary detokenization in multimodal processing ( #12868 )
2025-02-07 06:21:17 -08:00
Lu Fang
1918aa1b80
[MISC][EASY] Break check file names into entry and args in the pre-commit hooks ( #12880 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-07 13:04:39 +00:00
Maximilien de Bayser
6e1fc61f0f
Prevent unecessary requests to huggingface hub ( #12837 )
2025-02-06 21:37:41 -08:00
Szymon Ożóg
aa375dca9f
[Bugfix] Missing quant_config in deepseek embedding layer ( #12836 )
2025-02-06 21:35:09 -08:00
ZSL98
433c4a4923
Make vllm compatible with verl ( #12824 )
...
Co-authored-by: zhangshulai <zhangshulai@bytedance.com >
2025-02-07 11:54:20 +08:00
Lucas Wilkinson
ef533d25fb
[Bugfix] FA2 illegal memory access ( #12848 )
2025-02-06 19:54:07 -08:00
Kevin H. Luu
b260782357
[misc] Revert # 12833 ( #12857 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-06 16:29:12 -08:00
Lu Fang
741429a4cd
[MISC] Check space in the file names in the pre commit checks ( #12804 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-06 15:36:21 -08:00
Yu Chin Fabian Lim
aff404571b
Add Bamba Model ( #10909 )
...
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-06 15:22:42 -08:00
Varun Sundar Rabindranath
467a96a541
[V1] LoRA Support ( #10957 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-06 09:32:51 -08:00
Isotr0py
8108ac841d
[Bugfix] Fix unsupported FA version check for Turing GPU ( #12828 )
2025-02-06 09:18:22 -08:00
Jitse Klomp
afe74f7a96
[Doc] double quote cmake package in build.inc.md ( #12840 )
2025-02-06 09:17:55 -08:00
youkaichao
09b95e36ab
[torch.compile] PyTorch 2.6 and nightly compatibility ( #12393 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-07 01:09:07 +08:00
Isotr0py
85ac82d228
[Kernel] Make rotary_embedding ops more flexible with input shape ( #12777 )
2025-02-06 08:46:13 -08:00
Cyrus Leung
1e57b1ee63
[Misc] Remove unnecessary decode call ( #12833 )
2025-02-06 08:45:44 -08:00
Kevin H. Luu
e152f29502
[misc] Reduce number of config file requests to HuggingFace ( #12797 )
...
Signed-off-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-06 14:59:18 +00:00
Lucas Wilkinson
c786e757fa
[Attention] Use FA3 for MLA on Hopper ( #12807 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-02-06 11:43:12 +00:00
Simon Mo
cefd56ee35
[Docs] Add Google Cloud Slides ( #12814 )
2025-02-06 01:02:38 -08:00
Dipika Sikka
7ca9934fe7
[Misc] Update w2 scale loading for GPTQMarlinMoE ( #12757 )
2025-02-06 01:02:14 -08:00
youkaichao
0408efc6d0
[Misc] Improve error message for incorrect pynvml ( #12809 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-06 15:23:50 +08:00
Michael Goin
449d1bce02
[Misc] Remove duplicated DeepSeek V2/V3 model definition ( #12793 )
2025-02-05 23:16:20 -08:00
Harry Mellor
1a6fcad4c9
Improve TransformersModel UX ( #12785 )
2025-02-05 22:24:57 -08:00
Lu Fang
56534cd577
[Bugfix] Fix the test_ultravox.py's license ( #12806 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-06 13:25:54 +08:00
Sumit Vij
d88506dda4
[Model] LoRA Support for Ultravox model ( #11253 )
2025-02-05 19:54:13 -08:00
Lu Fang
9cdea30b4f
[Misc][Easy] Remove the space from the file name
2025-02-05 19:23:35 -08:00
Lucas Wilkinson
76abd0c881
[Bugfix] Better FP8 supported defaults
2025-02-05 19:22:19 -08:00
Gregory Shtrasberg
5b19b93082
[ROCm][Kernel] Using the correct warp_size value
2025-02-05 19:15:08 -08:00
Cyrus Leung
75404d041b
[VLM] Update compatibility with transformers 4.49
2025-02-05 19:09:45 -08:00
Roger Wang
bf3b79efb8
[VLM] Qwen2.5-VL
2025-02-05 13:31:38 -08:00
Russell Bryant
9a5b1554b4
[Docs] Drop duplicate [source] links
2025-02-05 13:30:50 -08:00
Cyrus Leung
a4ce74c14a
[VLM] Use shared field to pass token ids to model
2025-02-05 13:30:46 -08:00
Rahul Tuli
3b2005e1db
Add: Support for Sparse24Bitmask Compressed Models
2025-02-05 13:30:43 -08:00
Sanju C Sudhakaran
af8486de49
[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU)
2025-02-05 13:29:45 -08:00
Chen Zhang
4c3aac51e1
Merging PR #12536
...
Merged via CLI script
2025-02-05 13:24:26 -08:00
youkaichao
bc1bdecebf
[core][distributed] exact ray placement control ( #12732 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-06 02:03:19 +08:00
Akash kaothalkar
022bcc701a
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 ( #12546 )
2025-02-04 23:11:02 -08:00
Michael Goin
c53dc466b1
[Doc] Remove performance warning for auto_awq.md ( #12743 )
2025-02-04 22:43:11 -08:00
Nick Hill
3d09e592a8
[V1][Misc] Shorten FinishReason enum and use constant strings ( #12760 )
2025-02-04 22:43:02 -08:00
Harry Mellor
fcf2e3d7fc
[Bugfix] Fix OpenVINO model runner ( #12750 )
2025-02-04 22:42:46 -08:00
Michael Goin
58b218d7ae
[Doc] Update PR Reminder with link to Developer Slack ( #12748 )
2025-02-04 22:42:09 -08:00
Kyle Sayers
7ff7a638b6
[Model][Quant] Fix GLM, Fix fused module mappings for quantization ( #12634 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-02-05 05:32:06 +00:00
Dipika Sikka
686006a220
[Misc] Bump the compressed-tensors version ( #12736 )
2025-02-04 20:44:48 -08:00
Isotr0py
98fd089fc9
[VLM] Add MLA with pure RoPE support for deepseek-vl2 models ( #12729 )
2025-02-04 20:44:26 -08:00
Harry Mellor
249824c3bf
Refactor Linear handling in TransformersModel ( #12727 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-05 04:31:12 +00:00
Aleksandr Malyshev
64862d106e
[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling ( #12713 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-02-05 03:58:22 +00:00
Aviv Keshet
b3a0d01e45
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS ( #12368 )
...
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com >
2025-02-04 18:46:26 -08:00
Lucas Wilkinson
75e94309e8
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) ( #12676 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-02-04 18:22:24 -08:00
Mark McLoughlin
233df6f5c4
[V1][Metrics] Add request_success_total counter, labelled with finish reason ( #12579 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-02-04 19:46:54 -05:00
Cyrus Leung
18016a5e62
[Bugfix] Fix CI failures for InternVL and Mantis models ( #12728 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-04 23:54:23 +08:00
Sophie du Couédic
649550f27e
[Build] update requirements of no-device for plugin usage ( #12630 )
...
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com >
2025-02-04 21:19:12 +08:00
Kero Liang
62467a834a
Avoid unnecessary multi-modal input data copy when len(batch) == 1 ( #12722 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-02-04 21:03:19 +08:00
Michael Greenbaum
6469038b14
[Bugfix] Fix loading of fine-tuned models based on Phi-3-Small ( #12689 )
...
Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com >
Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com >
2025-02-04 20:58:48 +08:00
Isotr0py
815079de8e
[VLM] merged multimodal processor and V1 support for idefics3 ( #12660 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-02-04 20:00:51 +08:00
Woosuk Kwon
18a88fcccc
[V1] Remove scheduling constraint on partial requests ( #12674 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-04 02:43:58 -08:00
Cyrus Leung
d1ca7df84d
[VLM] Merged multi-modal processor for InternVL-based models ( #12553 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-02-04 16:44:52 +08:00
Jee Jee Li
96b23621c1
[Misc] Add BNB quantization for Whisper ( #12381 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-04 16:27:36 +08:00
Hongxia Yang
c36ac98d01
[AMD][ROCm] Enable DeepSeek model on ROCm ( #12662 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2025-02-04 08:24:11 +00:00
Kyle Sayers
4896d0c2dd
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs ( #12711 )
2025-02-03 23:27:11 -08:00
Thomas Parnell
bb392af434
[Doc] Replace ibm-fms with ibm-ai-platform ( #12709 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-02-04 07:05:04 +00:00
Michael Goin
5d98d56089
Support Pixtral-Large HF by using llava multimodal_projector_bias config ( #12710 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-02-04 11:55:46 +08:00
Russell Bryant
73b35cca7f
[Core] Improve hash collision avoidance in prefix caching ( #12621 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-03 16:28:20 -08:00
Cody Yu
5095e96606
[V1] Revert uncache_blocks and support recaching full blocks ( #12415 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-03 15:04:53 -08:00
Cody Yu
cf58b9c4ca
[MISC] Remove model input dumping when exception ( #12582 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-03 13:34:16 -08:00
kushanam
4797dad3ec
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 ( #12707 )
2025-02-03 13:30:39 -08:00
Kyle Sayers
6dd5e52823
Squelch MLA warning for Compressed-Tensors Models ( #12704 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-02-03 13:29:56 -08:00
Tyler Michael Smith
c11de33dad
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm ( #12696 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-03 13:04:59 -08:00
Russell Bryant
33e0602e59
[Misc] Fix improper placement of SPDX header in scripts ( #12694 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-03 11:16:59 -08:00
Arthur
a1a2aaadb9
[Model]: Add transformers backend support ( #11330 )
...
# Adds support for `transformers` as a backend
Following https://github.com/huggingface/transformers/pull/35235 , a
bunch of models should already be supported, we are ramping up support
for more models.
Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes:
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support
---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-02-03 21:30:38 +08:00
youkaichao
1298a400e8
[ci/build] fix gh200 test ( #12681 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 15:59:49 +08:00
youkaichao
ad4a9dc817
[cuda] manually import the correct pynvml module ( #12679 )
...
fixes problems like https://github.com/vllm-project/vllm/pull/12635 and
https://github.com/vllm-project/vllm/pull/12636 and
https://github.com/vllm-project/vllm/pull/12565
---------
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 15:58:21 +08:00
Srikanth Srinivas
b9986454fe
Fix for attention layers to remain unquantized during moe_wn16 quant ( #12570 )
...
Fix to AWQ quant loading of the new R1 model
The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit
The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again
---------
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Beim <beim2015@outlook.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Ryan N <ryan.nguyen@centml.ai >
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Shawn Du <shawnd200@outlook.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Beim <805908499@qq.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com >
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com >
Co-authored-by: fade_away <1028552010@qq.com >
Co-authored-by: weilong.yu <weilong.yu@shopee.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com >
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Shawn Du <shawnd200@outlook.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-02-03 13:46:19 +08:00
Eldar Kurtic
c5932e5dac
Properly check if all fused layers are in the list of targets ( #12666 )
...
Thanks @kylesayrs for catching this!
2025-02-03 13:42:18 +08:00
youkaichao
20579c0fae
make sure mistral_common not imported for non-mistral models ( #12669 )
...
When people use deepseek models, they find that they need to solve cv2
version conflict, see https://zhuanlan.zhihu.com/p/21064432691 .
I added the check, and make all imports of `cv2` lazy.
---------
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 13:40:25 +08:00
Yang Chen
95460fc513
[Kernel] port sgl moe_align_block_size kernels ( #12574 )
...
sgl_moe_align_block_size is based on:
ded9fcd09a
moe_align_block_size is based on:
ba5112ff69
Signed-off-by: Yang Chen <yangche@fb.com >
2025-02-03 13:09:50 +08:00
Zhuohan Li
326fcc8b9f
[Doc] Deprecate Discord ( #12668 )
2025-02-02 19:19:56 -08:00
youkaichao
e64330910b
[doc][misc] clarify VLLM_HOST_IP for multi-node inference ( #12667 )
...
As more and more people are trying deepseek models with multi-node
inference, https://github.com/vllm-project/vllm/issues/7815 becomes more
frequent. Let's give clear message to users.
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 09:32:18 +08:00
Russell Bryant
e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files ( #12628 )
...
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**
commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:18:24 2025 -0500
Add SPDX license headers to python source files
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
also be easily used by tools to help manage license compliance.
The Linux Foundation runs license scans against the codebase to help
ensure
we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
More information can be found on the SPDX site:
- https://spdx.dev/learn/handling-license-info/
Signed-off-by: Russell Bryant <rbryant@redhat.com >
commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:36:32 2025 -0500
Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 11:58:18 -08:00
Kunshang Ji
f256ebe4df
[Hardware][Intel GPU] add XPU bf16 support ( #12392 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-02-02 10:17:26 +00:00
Shawn Du
f8ece6e17f
[Core][v1] Unify allocating slots in prefill and decode in KV cache manager ( #12608 )
...
As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254 ,
this PR achieves the task: combine allocate_slots and append_slots.
There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.
@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo
---------
Signed-off-by: Shawn Du <shawnd200@outlook.com >
2025-02-02 16:40:58 +08:00
Woosuk Kwon
abfcdcdf27
[V1][Minor] Avoid frequently creating ConstantList ( #12653 )
...
A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-01 23:43:20 -08:00
Russell Bryant
e497f33491
[Core] Silence unnecessary deprecation warnings ( #12620 )
...
I noticed during testing that I was getting a lot of these deprecation
warnings about `local_lora_path`:
```
DeprecationWarning: The 'lora_local_path' attribute is deprecated
and will be removed in a future version.
Please use 'lora_path' instead.
```
The check used for emitting this warning was always True, even when the
parameter was not actually specified. It will always be in
`__struct_fields__`. We should be checking for a non-None value,
instead.
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 15:35:50 +08:00
Jinzhen Lin
baaa2b24da
[Bugfix] fix moe_wna16 get_quant_method ( #12648 )
...
Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.
baeded2569/vllm/attention/layer.py (L86-L92)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-02-02 15:29:56 +08:00
Vicente Herrera
b4e5c03306
doc: fixing minor typo in readme.md ( #12643 )
...
Word "evolved" was mistyped
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
---------
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
2025-02-01 17:17:29 +00:00
Michael Goin
3194039c0e
Apply torch.compile to fused_moe/grouped_topk ( #12637 )
2025-02-01 16:16:19 +00:00
Simon Mo
4f4d427ac2
Disable chunked prefill and/or prefix caching when MLA is enabled ( #12642 )
...
Create Release / Create Release (push) Has been cancelled
From @mgoin in https://github.com/vllm-project/vllm/pull/12638
I cannot push to that branch, therefore a new PR to unblock release.
---------
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-31 23:46:57 -08:00
Russell Bryant
1e3698393f
[CI/Build] Add label automation for structured-output, speculative-decoding, v1 ( #12280 )
...
We have `v1`, `structured-output`, and `speculative-decoding` labels on
github. This adds automation for applying these labels based on the
files touched by a PR.
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-31 23:13:10 -08:00
Lucas Wilkinson
baeded2569
[Attention] Deepseek v3 MLA support with FP8 compute ( #12601 )
...
This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights
---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
2025-01-31 21:52:51 -08:00
Rahul Tuli
3e1c76cf3a
Fix: Respect sparsity_config.ignore in Cutlass Integration ( #12517 )
...
This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.
---
### Changes
- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.
---
<details>
<summary>Testing Setup</summary>
The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097 ).
#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```
#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
QuantizationStrategy,
QuantizationType)
from pydantic import BaseModel
-
+from vllm.logger import init_logger
from vllm.model_executor.layers.fused_moe import FusedMoE
from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
should_ignore_layer)
from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
__all__ = ["CompressedTensorsLinearMethod"]
SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```
Apply using:
```bash
git apply logging-patch.patch
```
</details>
---
<details>
<summary>Models Tested</summary>
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24`
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`
</details>
---
<details>
<summary>Example Output</summary>
#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```
#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```
</details>
**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.
---
For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160 )](https://app.asana.com/0/0/1209227810815160 ).
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
2025-02-01 13:41:59 +08:00
Tyler Michael Smith
cfa134d247
[Bugfix/CI] Fixup benchmark_moe.py ( #12562 )
...
Fixes `is_marlin` not being passed into `get_default_config`
Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size`
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-01 13:41:35 +08:00
Kevin H. Luu
35b7a05507
[ci] Upgrade transformers to 4.48.2 in CI dependencies ( #12599 )
2025-01-31 21:22:23 -08:00
Eldar Kurtic
1867c258bd
Fix target matching for fused layers with compressed-tensors ( #12617 )
...
Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).
Example recipe:
```
recipe = """
quantization_stage:
run_type: oneshot
quantization_modifiers:
GPTQModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: "group"
group_size: 128
targets: [
"model.layers.0.mlp.down_proj",
"model.layers.2.mlp.down_proj",
"model.layers.3.mlp.down_proj",
"model.layers.4.mlp.down_proj",
"model.layers.5.mlp.down_proj",
"model.layers.6.mlp.down_proj",
"model.layers.7.mlp.down_proj",
"model.layers.8.mlp.down_proj",
"model.layers.9.mlp.down_proj",
"model.layers.10.mlp.down_proj",
"model.layers.11.mlp.down_proj",
"model.layers.12.mlp.down_proj",
"model.layers.13.mlp.down_proj",
"model.layers.14.mlp.down_proj",
"model.layers.15.mlp.down_proj",
"model.layers.16.mlp.down_proj",
"model.layers.17.mlp.down_proj",
"model.layers.19.mlp.down_proj",
"model.layers.21.mlp.down_proj",
"model.layers.22.mlp.down_proj",
.
.
.
]
"""
```
To reproduce the vLLM error:
```bash
vllm serve nm-testing/eldar-test
```
With this PR
------------
Models are loaded correctly without any errors.
2025-02-01 05:07:46 +00:00
fade_away
cb3e73e4c8
[BugFix] fix wrong output when using lora and num_scheduler_steps=8 ( #11161 )
...
FIX issue https://github.com/vllm-project/vllm/issues/9688
https://github.com/vllm-project/vllm/issues/11086 #12487
---------
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: weilong.yu <weilong.yu@shopee.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-01 12:52:07 +08:00
Robert Shaw
b1340f9d55
[V1] Bugfix: Validate Model Input Length ( #12600 )
...
SUMMARY:
* avoid crashing the engine when we get an input longer than
max_model_len
FIX #12567(*link existing issues this PR will resolve*)
2025-01-31 18:32:04 -08:00
Brian Dellabetta
44bbca78d7
[Doc] int4 w4a16 example ( #12585 )
...
Based on a request by @mgoin , with @kylesayrs we have added an example
doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
quantization example and the example available in
[`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py )
FIX #n/a (no issue created)
@kylesayrs and I have discussed a couple additional improvements for the
quantization docs. We will revisit at a later date, possibly including:
- A section for "choosing the correct quantization scheme/ compression
technique"
- Additional vision or audio calibration datasets
---------
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-31 15:38:48 -08:00
Harry Mellor
60808bd4c7
[Doc] Improve installation signposting ( #12575 )
...
- Make device tab names more explicit
- Add comprehensive list of devices to
https://docs.vllm.ai/en/latest/getting_started/installation/index.html
- Add `attention` blocks to the intro of all devices that don't have
pre-built wheels/images
---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-31 15:38:35 -08:00
Ryan Nguyen
fc542144c4
[Feature] Fix guided decoding blocking bitmask memcpy ( #12563 )
...
**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.
(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

With the optimization, this is no longer the case:

---------
Signed-off-by: Ryan N <ryan.nguyen@centml.ai >
2025-01-31 15:37:30 -08:00
Tyler Michael Smith
eb5741ad42
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 ( #12587 )
...
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-31 15:29:11 -08:00
Robert Shaw
145c2ff648
[Bugfix] Revert MoE Triton Config Default ( #12629 )
...
SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files ) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-31 15:28:47 -08:00
Kevin H. Luu
415f19474d
[release] Add input step to ask for Release version ( #12631 )
...
Instead of having to create a new build with release version put in as
env var.
2025-01-31 13:39:36 -08:00
Chen Zhang
89003c4082
[v1][Bugfix] Add extra_keys to block_hash for prefix caching ( #12603 )
...
This pr adds extra key to block hash, to generate different hash value
for two blocks with the same token string but different extra_keys in
their parent blocks. For example, it can generate different hash value
for the second block of the following two requests:
```python
request1 = make_request(
request_id=0,
prompt_token_ids=[_ for _ in range(6)],
mm_positions=[{
"offset": 0,
"length": 3
}, {
"offset": 3,
"length": 3
}],
mm_hashes=["hash1", "hash2"],
)
request2 = make_request(
request_id=1,
prompt_token_ids=[_ for _ in range(6)],
mm_positions=[{
"offset": 0,
"length": 3
}, {
"offset": 3,
"length": 3
}],
mm_hashes=["hash3", "hash2"],
)
```
---------
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-31 13:13:04 -08:00
Cody Yu
60bcef000e
[Docs][V1] Prefix caching design ( #12598 )
...
- Create v1 design document section in docs.
- Add prefix caching design doc.
@WoosukKwon @ywang96
---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-31 12:30:46 -08:00
Cody Yu
847f883232
[Git] Automatically sign-off commits ( #12595 )
...
It's very annoying when I forgot to add `-s` in `git commit` to
sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
push -f` to fix the DCO. This PR adds a hook to sign off commits
automatically when `-s` is missing to solve this problem. The only
change from the user side is now users have to install 2 hooks, so
instead of just
```
pre-commit install
```
Now we need to
```
pre-commit install --hook-type pre-commit --hook-type commit-msg
```
Note that even if users still only install the pre-commit hook, they
won't get any error in `git commit`. Just the sign-off hook won't run.
cc @hmellor @youkaichao
---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-31 12:30:33 -08:00
Robert Shaw
325f679f32
[BugFix] Fix Torch.Compile For DeepSeek ( #12594 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-31 12:06:39 -08:00
Harry Mellor
e3f7ff65e7
Add favicon to docs ( #12611 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-31 09:20:34 -08:00
Roger Wang
7a8987dac5
[Bugfix] Gracefully handle huggingface hub http error ( #12571 )
2025-01-31 08:19:35 +00:00
Lucas Wilkinson
cabaf4eff3
[Attention] MLA decode optimizations ( #12528 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 23:49:37 -08:00
Aleksandr Malyshev
a1fc18c030
[ROCm][AMD][Model] llama 3.2 support upstreaming ( #12421 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-01-31 12:24:28 +08:00
Lucas Wilkinson
9798b2fb00
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling ( #11868 )
2025-01-30 18:33:00 -08:00
Michael Goin
4078052f09
[V1][Log] Add max request concurrency log to V1 ( #12569 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-30 23:07:19 +00:00
Nishidha
bd2107e30a
[CPU][PPC] Updated torch, torchvision, torchaudio dependencies ( #12555 )
...
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
2025-01-30 16:29:39 -05:00
Robert Shaw
9b0c4bab36
[Kernel] Triton Configs for Fp8 Block Quantization ( #11589 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 11:53:22 -08:00
Beim
41bf5612f5
[Misc] fix typo: add missing space in lora adapter error message ( #12564 )
...
Signed-off-by: Beim <beim2015@outlook.com >
2025-01-30 15:39:22 +00:00
Harry Mellor
a2769032ca
Set ?device={device} when changing tab in installation guides ( #12560 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-30 00:05:42 -08:00
Mark McLoughlin
f17f1d4608
[V1][Metrics] Add GPU cache usage % gauge ( #12561 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-29 18:31:01 -08:00
Divakar Verma
1c1bb0bbf2
[Misc][MoE] add Deepseek-V3 moe tuning support ( #12558 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-30 00:47:30 +00:00
Woosuk Kwon
e0cc5f259a
[V1][BugFix] Free encoder cache for aborted requests ( #12545 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-29 13:47:33 -08:00
Tyler Michael Smith
73aa6cfdf7
Revert "[Build/CI] Fix libcuda.so linkage" ( #12552 )
2025-01-29 21:12:24 +00:00
Jinzhen Lin
27b78c73ca
[Kernel] add triton fused moe kernel for gptq/awq ( #12185 )
2025-01-29 09:07:09 -05:00
Pavani Majety
b02fd288b2
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. ( #11787 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-29 01:46:12 -08:00
Yanyi Liu
ff7424f491
[Frontend] Support override generation config in args ( #12409 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
2025-01-29 01:41:01 -08:00
Alphi
d93bf4da85
[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM ( #12069 )
...
Signed-off-by: hzh <hezhihui_thu@163.com >
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Oleg Mosalov <oleg@krai.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
Signed-off-by: Chenguang Li <757486878@qq.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Shanshan Shen <467638484@qq.com >
Signed-off-by: elijah <f1renze.142857@gmail.com >
Signed-off-by: Yikun <yikunkero@gmail.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com >
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: sixgod <evethwillbeok@outlook.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com >
Co-authored-by: Oleg Mosalov <oleg@krai.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com >
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com >
Co-authored-by: Concurrensee <yida.wu@amd.com >
Co-authored-by: Chenguang Li <757486878@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Alex Brooks <alex.brooks@ibm.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Shanshan Shen <467638484@qq.com >
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com >
Co-authored-by: Yikun Jiang <yikunkero@gmail.com >
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Konrad Zawora <kzawora@habana.ai >
Co-authored-by: TJian <tunjian1996@gmail.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com >
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-01-29 09:24:59 +00:00
Travis Johnson
036ca94c25
[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense ( #12347 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Wallas Santos <wallashss@ibm.com >
2025-01-29 08:54:35 +00:00
Maximilien de Bayser
ef001d98ef
Fix the pydantic logging validator ( #12420 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-01-29 07:53:13 +00:00
Robert Shaw
5f671cb4c3
[V1] Improve Error Message for Unsupported Config ( #12535 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-29 04:56:56 +00:00
Michael Goin
bd02164cf9
Bugfix for whisper quantization due to fake k_proj bias ( #12524 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-29 04:49:03 +00:00
Mark McLoughlin
46fb056749
[V1][Metrics] Add TTFT and TPOT histograms ( #12530 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-29 04:11:16 +00:00
Harry Mellor
dd6a3a02cb
[Doc] Convert docs to use colon fences ( #12471 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-29 11:38:29 +08:00
Ce Gao
a7e3eba66f
[Frontend] Support reasoning content for deepseek r1 ( #12473 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
2025-01-29 11:38:08 +08:00
Michael Goin
fbb5bd4cef
[TPU] Add example for profiling TPU inference ( #12531 )
...
Signed-off-by: mgoin <mgoin@redhat.com >
2025-01-29 03:16:47 +00:00
fenghuizhang
80fcc3ed1c
[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels ( #12482 )
...
Signed-off-by: Fenghui Zhang <fhzhang@google.com >
2025-01-28 22:36:44 +00:00
Mark McLoughlin
c386c43ca3
[V1][Metrics] Add per-request prompt/generation_tokens histograms ( #12516 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-28 22:07:22 +00:00
Harry Mellor
f26d790718
Do not run suggestion pre-commit hook multiple times ( #12521 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-28 20:05:27 +00:00
Michael Goin
0f657bdc52
Replace missed warning_once for rerank API ( #12472 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-28 19:06:32 +00:00
Mark McLoughlin
3fd1fb63ef
[V1][Metrics] Hook up IterationStats for Prometheus metrics ( #12478 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-28 16:38:38 +00:00
Jun Duan
925d2f1908
[Doc] Fix typo for x86 CPU installation ( #12514 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-01-28 16:37:10 +00:00
Cyrus Leung
8f58a51358
[VLM] Merged multi-modal processor and V1 support for Qwen-VL ( #12504 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-28 16:25:05 +00:00
Sebastian Schoennenbeck
2079e43bee
[Core] Make raw_request optional in ServingCompletion ( #12503 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-01-28 10:56:45 +00:00
Robert Shaw
e29d4358ef
[V1] Include Engine Version in Logs ( #12496 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-28 08:27:41 +00:00
Roger Wang
8cbc424975
Update README.md with V1 alpha release ( #12495 )
2025-01-28 08:22:41 +00:00
Mengqing Cao
dd66fd2b01
[CI] fix pre-commit error ( #12494 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-01-28 06:11:05 +00:00
Gabriel Marinho
0f465ab533
[FEATURE] Enables offline /score for embedding models ( #12021 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
2025-01-28 11:30:13 +08:00
Hossein Sarshar
23a7cbc88b
[CI/Build] Fixed the xla nightly issue report in #12451 ( #12453 )
2025-01-28 11:18:07 +08:00
Michael Goin
426a5c3625
Fix bad path in prometheus example ( #12481 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-27 18:56:31 -07:00
Liangfu Chen
ddee88d0ff
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache ( #11277 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
Co-authored-by: Jiangfei Duan <jfduan@outlook.com >
2025-01-27 17:31:16 -08:00
Harry Mellor
823ab79633
Update pre-commit hooks ( #12475 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-27 17:23:08 -07:00
Nicolò Lucchesi
6116ca8cd7
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill ( #10132 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: wallashss <wallashss@ibm.com >
Co-authored-by: wallashss <wallashss@ibm.com >
2025-01-27 13:38:35 -08:00
Bowen Wang
2bc3fbba0c
[FlashInfer] Upgrade to 0.2.0 ( #11194 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-01-27 18:19:24 +00:00
Woosuk Kwon
3f1fc7425a
[V1][CI/Test] Do basic test for top-p & top-k sampling ( #12469 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-27 09:40:04 -08:00
Mark McLoughlin
01ba927040
[V1][Metrics] Add initial Prometheus logger ( #12416 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-27 12:26:28 -05:00
Lucas Wilkinson
103bd17ac5
[Build] Only build 9.0a for scaled_mm and sparse kernels ( #12339 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-27 10:40:00 -05:00
Isotr0py
ce69f7f754
[Bugfix] Fix gpt2 GGUF inference ( #12467 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-27 18:31:49 +08:00
Woosuk Kwon
624a1e4711
[V1][Minor] Minor optimizations for update_from_output ( #12454 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-27 01:09:27 -08:00
Isotr0py
372bf0890b
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata ( #12464 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-27 07:25:30 +00:00
Cyrus Leung
5204ff5c3f
[Bugfix] Fix Granite 3.0 MoE model loading ( #12446 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-26 21:26:44 -08:00
Pooya Davoodi
0cc6b383d7
[Frontend] Support scores endpoint in run_batch ( #12430 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-01-27 04:30:17 +00:00
Woosuk Kwon
28e0750847
[V1] Avoid list creation in input preparation ( #12457 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-26 19:57:56 -08:00
Yuan Tang
582cf78798
[DOC] Add link to vLLM blog ( #12460 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-27 03:46:19 +00:00
Kyle Mistele
0034b09ceb
[Frontend] Rerank API (Jina- and Cohere-compatible API) ( #12376 )
...
Signed-off-by: Kyle Mistele <kyle@mistele.com >
2025-01-26 19:58:45 -07:00
Tyler Michael Smith
72bac73067
[Build/CI] Fix libcuda.so linkage ( #12424 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-26 21:18:19 +00:00
Lucas Wilkinson
68f11149d8
[Bugfix][Kernel] Fix perf regression caused by PR #12405 ( #12434 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-26 11:09:34 -08:00
Tyler Michael Smith
72f4880425
[Bugfix/CI] Fix broken kernels/test_mha.py ( #12450 )
2025-01-26 10:39:03 -08:00
Tyler Michael Smith
aa2cd2c43d
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 ( #12417 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-26 19:59:58 +08:00
Matthew Hendrey
9ddc35220b
[Frontend] generation_config.json for maximum tokens( #12242 )
...
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com >
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: shangmingc <caishangming@linux.alibaba.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-26 19:59:25 +08:00
Roger Wang
a5255270c3
[Misc] Revert FA on ViT #12355 and #12435 ( #12445 )
2025-01-26 03:56:34 -08:00
Roger Wang
0ee349b553
[V1][Bugfix] Fix assertion when mm hashing is turned off ( #12439 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-26 00:47:42 -08:00
Keyun Tong
fa63e710c7
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync ( #12094 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-01-26 00:42:37 -08:00
Roger Wang
2a0309a646
[Misc][Bugfix] FA3 support to ViT MHA layer ( #12435 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-26 05:00:31 +00:00
Siyuan Liu
324960a95c
[TPU][CI] Update torchxla version in requirement-tpu.txt ( #12422 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-01-25 07:23:03 +00:00
Isotr0py
f1fc0510df
[Misc] Add FA2 support to ViT MHA layer ( #12355 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-25 15:07:35 +08:00
Divakar Verma
bf21481dde
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 ( #12408 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-25 12:17:19 +08:00
Cyrus Leung
fb30ee92ee
[Bugfix] Fix BLIP-2 processing ( #12412 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-25 11:42:42 +08:00
ElizaWszola
221d388cc5
[Bugfix][Kernel] Fix moe align block issue for mixtral ( #12413 )
2025-01-25 01:49:28 +00:00
Lucas Wilkinson
3132a933b6
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). ( #12405 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-24 20:20:59 +00:00
Cyrus Leung
df5dafaa5b
[Misc] Remove deprecated code ( #12383 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-24 14:45:20 -05:00
Lucas Wilkinson
ab5bbf5ae3
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build ( #12375 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-24 15:27:59 +00:00
Junichi Sato
3bb8e2c9a2
[Misc] Enable proxy support in benchmark script ( #12356 )
...
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp >
2025-01-24 14:58:26 +00:00
youkaichao
e784c6b998
[ci/build] sync default value for wheel size ( #12398 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 17:54:29 +08:00
Mohit Deopujari
9a0f3bdbe5
[Hardware][Gaudi][Doc] Add missing step in setup instructions ( #12382 )
2025-01-24 09:43:49 +00:00
youkaichao
c7c9851036
[ci/build] fix wheel size check ( #12396 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 17:31:25 +08:00
Roger Wang
3c818bdb42
[Misc] Use VisionArena Dataset for VLM Benchmarking ( #12389 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-24 00:22:04 -08:00
youkaichao
6dd94dbe94
[perf] fix perf regression from #12253 ( #12380 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 11:34:27 +08:00
Woosuk Kwon
0e74d797ce
[V1] Increase default batch size for H100/H200 ( #12369 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-24 03:19:55 +00:00
Dipika Sikka
55ef66edf4
Update compressed-tensors version ( #12367 )
2025-01-24 11:19:42 +08:00
omer-dayan
5e5630a478
[Bugfix] Path join when building local path for S3 clone ( #12353 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
2025-01-24 11:06:07 +08:00
Russell Bryant
d3d6bb13fb
Set weights_only=True when using torch.load() ( #12366 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-24 02:17:30 +00:00
Nick Hill
24b0205f58
[V1][Frontend] Coalesce bunched RequestOutputs ( #12298 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2025-01-23 17:17:41 -08:00
Russell Bryant
c5cffcd0cd
[Docs] Update spec decode + structured output in compat matrix ( #12373 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-24 01:15:52 +00:00
Woosuk Kwon
682b55bc07
[Docs] Add meetup slides ( #12345 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-23 14:10:03 -08:00
Junichi Sato
9726ad676d
[Misc] Fix OpenAI API Compatibility Issues in Benchmark Script ( #12357 )
...
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp >
2025-01-23 17:02:13 -05:00
Dipika Sikka
eb5cb5e528
[BugFix] Fix parameter names and process_after_weight_loading for W4A16 MoE Group Act Order ( #11528 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-23 21:40:33 +00:00
Isotr0py
2cbeedad09
[Docs] Document Phi-4 support ( #12362 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-23 19:18:51 +00:00
Siyuan Liu
2c85529bfc
[TPU] Update TPU CI to use torchxla nightly on 20250122 ( #12334 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-01-23 18:50:16 +00:00
Gregory Shtrasberg
e97f802b2d
[FP8][Kernel] Dynamic kv cache scaling factors computation ( #11906 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2025-01-23 18:04:03 +00:00
youkaichao
6e650f56a1
[torch.compile] decouple compile sizes and cudagraph sizes ( #12243 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 02:01:30 +08:00
youkaichao
3f50c148fd
[core] add wake_up doc and some sanity check ( #12361 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 02:00:50 +08:00
Isotr0py
8c01b8022c
[Bugfix] Fix broken internvl2 inference with v1 ( #12360 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-23 17:20:33 +00:00
Roger Wang
99d01a5e3d
[V1] Simplify M-RoPE ( #12352 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: imkero <kerorek@outlook.com >
2025-01-23 23:13:23 +08:00
Cyrus Leung
d07efb31c5
[Doc] Troubleshooting errors during model inspection ( #12351 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-23 22:46:58 +08:00
Lucas Wilkinson
978b45f399
[Kernel] Flash Attention 3 Support ( #12093 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-23 06:45:48 -08:00
Isotr0py
c5b4b11d7f
[Bugfix] Fix k_proj's bias for whisper self attention ( #12342 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-23 10:15:33 +00:00
liuzhenwei
8ae5ff2009
[Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update ( #12338 )
...
Signed-off-by: zhenwei <zhenweiliu@habana.ai >
2025-01-23 08:35:46 +00:00
youkaichao
511627445e
[doc] explain common errors around torch.compile ( #12340 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-23 14:56:02 +08:00
Cody Yu
f0ef37233e
[V1] Add uncache_blocks ( #12333 )
2025-01-23 04:19:21 +00:00
Russell Bryant
7551a34032
[Docs] Document vulnerability disclosure process ( #12326 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-23 03:44:09 +00:00
Michael Goin
01a55941f5
[Docs] Update FP8 KV Cache documentation ( #12238 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-23 11:18:09 +08:00
Alexei-V-Ivanov-AMD
8d7aa9de71
[Bugfix] Fixing AMD LoRA CI test. ( #12329 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-01-23 10:53:02 +08:00
rasmith
68c4421b6d
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD ( #12282 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-01-23 00:10:37 +00:00
Nick Hill
aea94362c9
[Frontend][V1] Online serving performance improvements ( #12287 )
2025-01-22 22:22:12 +00:00
Cody Yu
7206ce4ce1
[Core] Support reset_prefix_cache ( #12284 )
2025-01-22 18:52:27 +00:00
Konrad Zawora
96f6a7596f
[Bugfix] Fix HPU multiprocessing executor ( #12167 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-01-23 02:07:07 +08:00
Jee Jee Li
84bee4bd5c
[Misc] Improve the readability of BNB error messages ( #12320 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-22 16:56:54 +00:00
Robin
fc66dee76d
[Misc] Fix the error in the tip for the --lora-modules parameter ( #12319 )
...
Signed-off-by: wangerxiao <863579016@qq.com >
2025-01-22 16:48:41 +00:00
Cyrus Leung
6609cdf019
[Doc] Add docs for prompt replacement ( #12318 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-22 14:56:29 +00:00
Roger Wang
16366ee8bb
[Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 ( #12313 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-22 21:06:36 +08:00
zhou fan
528dbcac7d
[Model][Bugfix]: correct Aria model output ( #12309 )
...
Signed-off-by: xffxff <1247714429@qq.com >
2025-01-22 11:39:19 +00:00
Cyrus Leung
cd7b6f0857
[VLM] Avoid unnecessary tokenization ( #12310 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-22 11:08:31 +00:00
youkaichao
68ad4e3a8d
[Core] Support fully transparent sleep mode ( #11743 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-22 14:39:32 +08:00
Mengqing Cao
4004f144f3
[Build] update requirements of no-device ( #12299 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-01-22 14:29:31 +08:00
youkaichao
66818e5b63
[core] separate builder init and builder prepare for each batch ( #12253 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-22 14:13:52 +08:00
Nick Hill
222a9dc350
[Benchmark] More accurate TPOT calc in benchmark_serving.py ( #12288 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-01-22 13:46:14 +08:00
Cyrus Leung
cbdc4ad5a5
[Ci/Build] Fix mypy errors on main ( #12296 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-22 12:06:54 +08:00
Liangfu Chen
016e3676e7
[CI] add docker volume prune to neuron CI ( #12291 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-01-22 10:47:49 +08:00
Kevin H. Luu
64ea24d0b3
[ci/lint] Add back default arg for pre-commit ( #12279 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2025-01-22 01:15:27 +00:00
Cyrus Leung
df76e5af26
[VLM] Simplify post-processing of replacement info ( #12269 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 16:48:13 -08:00
Hongxia Yang
09ccc9c8f7
[Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose ( #12281 )
...
Signed-off-by: Hongxia Yang <hongxyan@amd.com >
2025-01-22 07:49:22 +08:00
Aleksandr Malyshev
69196a9bc7
[BUGFIX] When skip_tokenize_init and multistep are set, execution crashes ( #12277 )
...
Signed-off-by: maleksan85 <maleksan@amd.com >
Co-authored-by: maleksan85 <maleksan@amd.com >
2025-01-21 23:30:46 +00:00
Divakar Verma
2acba47d9b
[bugfix] moe tuning. rm is_navi() ( #12273 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-21 22:47:32 +00:00
Jani Monoses
9c485d9e25
[Core] Free CPU pinned memory on environment cleanup ( #10477 )
2025-01-21 11:56:41 -08:00
wangxiyuan
fa9ee08121
[Misc] Set default backend to SDPA for get_vit_attn_backend ( #12235 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-21 11:52:11 -08:00
Adrian Cole
347eeebe3b
[Misc] Remove experimental dep from tracing.py ( #12007 )
...
Signed-off-by: Adrian Cole <adrian.cole@elastic.co >
2025-01-21 11:51:55 -08:00
Andy Lo
18fd4a8331
[Bugfix] Multi-sequence broken ( #11898 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-01-21 11:51:35 -08:00
Ricky Xu
132a132100
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types ( #10907 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2025-01-21 11:51:13 -08:00
Jinzhen Lin
1e60f87bb3
[Kernel] fix moe_align_block_size error condition ( #12239 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-01-21 10:30:28 -08:00
Jannis Schönleber
9705b90bcf
[Bugfix] fix race condition that leads to wrong order of token returned ( #10802 )
...
Signed-off-by: Jannis Schönleber <joennlae@gmail.com >
2025-01-21 09:47:04 -08:00
youkaichao
3aec49e56f
[ci/build] update nightly torch for gh200 test ( #12270 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-21 23:03:17 +08:00
Mengqing Cao
c64612802b
[Platform] improve platforms getattr ( #12264 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-01-21 14:42:41 +00:00
Thomas Parnell
9a7c3a0042
Remove pytorch comments for outlines + compressed-tensors ( #12260 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-01-21 21:49:08 +08:00
Roger Wang
b197a5ccfd
[V1][Bugfix] Fix data item ordering in mixed-modality inference ( #12259 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-21 13:18:43 +00:00
youkaichao
c81081fece
[torch.compile] transparent compilation with more logging ( #12246 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-21 19:32:55 +08:00
Cyrus Leung
a94eee4456
[Bugfix] Fix mm_limits access for merged multi-modal processor ( #12252 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 10:09:39 +00:00
Cyrus Leung
f2e9f2a3be
[Misc] Remove redundant TypeVar from base model ( #12248 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 08:40:39 +00:00
Jee Jee Li
1f1542afa9
[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration ( #12237 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-21 07:49:08 +00:00
Cyrus Leung
96912550c8
[Misc] Rename MultiModalInputsV2 -> MultiModalInputs ( #12244 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 07:31:19 +00:00
youkaichao
2fc6944c5e
[ci/build] disable failed and flaky tests ( #12240 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-21 13:25:03 +08:00
Nicolò Lucchesi
5fe6bf29d6
[BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 ( #12230 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-01-21 12:23:14 +08:00
Gregory Shtrasberg
d4b62d4641
[AMD][Build] Porting dockerfiles from the ROCm/vllm fork ( #11777 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-01-21 12:22:23 +08:00
Michael Goin
ecf67814f1
Add quantization and guided decoding CODEOWNERS ( #12228 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-20 18:23:40 -07:00
Jinzhen Lin
750f4cabfa
[Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) ( #12222 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-20 16:42:16 -08:00
Cheng Kuan Yong Jason
06a760d6e8
[bugfix] catch xgrammar unsupported array constraints ( #12210 )
...
Signed-off-by: Jason Cheng <jasoncky96@gmail.com >
2025-01-20 16:42:02 -08:00
youkaichao
da7512215f
[misc] add cuda runtime version to usage data ( #12190 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-01-21 00:31:01 +00:00
Işık
af69a6aded
fix: update platform detection for M-series arm based MacBook processors ( #12227 )
...
Signed-off-by: isikhi <huseyin.isik000@gmail.com >
2025-01-20 22:23:28 +00:00
Roger Wang
7bd3630067
[Misc] Update CODEOWNERS ( #12229 )
2025-01-20 22:19:09 +00:00
Chen Zhang
96663699b2
[CI] Pass local python version explicitly to pre-commit mypy.sh ( #12224 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-20 23:49:18 +08:00
Cyrus Leung
18572e3384
[Bugfix] Fix HfExampleModels.find_hf_info ( #12223 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 15:35:36 +00:00
wangxiyuan
86bfb6dba7
[Misc] Pass attention to impl backend ( #12218 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-20 23:25:28 +08:00
Chen Zhang
5f0ec3935a
[V1] Remove _get_cache_block_size ( #12214 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-20 21:54:16 +08:00
youkaichao
c222f47992
[core][bugfix] configure env var during import vllm ( #12209 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 19:35:59 +08:00
youkaichao
170eb35079
[misc] print a message to suggest how to bypass commit hooks ( #12217 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 18:06:24 +08:00
Cyrus Leung
b37d82791e
[Model] Upgrade Aria to transformers 4.48 ( #12203 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 17:58:48 +08:00
Cyrus Leung
3127e975fb
[CI/Build] Make pre-commit faster ( #12212 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 17:36:24 +08:00
Cyrus Leung
4001ea1266
[CI/Build] Remove dummy CI steps ( #12208 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 16:41:57 +08:00
youkaichao
5c89a29c22
[misc] add placeholder format.sh ( #12206 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 16:04:49 +08:00
Cyrus Leung
59a0192fb9
[Core] Interface for accessing model from VllmRunner ( #10353 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 15:00:59 +08:00
Isotr0py
83609791d2
[Model] Add Qwen2 PRM model support ( #12202 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-20 14:59:46 +08:00
Yuan Tang
0974c9bc5c
[Bugfix] Fix incorrect types in LayerwiseProfileResults ( #12196 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-20 14:59:20 +08:00
Yuan Tang
d2643128f7
[DOC] Add missing docstring in LLMEngine.add_request() ( #12195 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-20 14:59:00 +08:00
Yuan Tang
c5c06209ec
[DOC] Fix typo in docstring and assert message ( #12194 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-20 14:58:29 +08:00
Harry Mellor
3ea7b94523
Move linting to pre-commit ( #11975 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-20 14:58:01 +08:00
youkaichao
51ef828f10
[torch.compile] fix sym_tensor_indices ( #12191 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 11:37:50 +08:00
shangmingc
df450aa567
[Bugfix] Fix num_heads value for simple connector when tp enabled ( #12074 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-01-20 02:56:43 +00:00
Martin Gleize
bbe5f9de7d
[Model] Support for fairseq2 Llama ( #11442 )
...
Signed-off-by: Martin Gleize <mgleize@meta.com >
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas >
2025-01-19 10:40:40 -08:00
Roger Wang
81763c58a0
[V1] Add V1 support of Qwen2-VL ( #12128 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: imkero <kerorek@outlook.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-19 19:52:13 +08:00
Isotr0py
edaae198e7
[Misc] Add BNB support to GLM4-V model ( #12184 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-19 19:49:22 +08:00
gujing
936db119ed
benchmark_serving support --served-model-name param ( #12109 )
...
Signed-off-by: zibai <zibai.gj@alibaba-inc.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-01-19 09:59:56 +00:00
youkaichao
e66faf4809
[torch.compile] store inductor compiled Python file ( #12182 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-19 16:27:26 +08:00
Cyrus Leung
630eb5b5ce
[Bugfix] Fix multi-modal processors for transformers 4.48 ( #12187 )
2025-01-18 19:16:34 -08:00
Michal Adamczyk
4e94951bb1
[BUGFIX] Move scores to float32 in case of running xgrammar on cpu ( #12152 )
...
Signed-off-by: Michal Adamczyk <madamczyk@habana.ai >
2025-01-19 11:12:05 +08:00
Simon Mo
7a8a48d51e
[V1] Collect env var for usage stats ( #12115 )
2025-01-19 03:07:15 +00:00
yancong
32eb0da808
[Misc] Support register quantization method out-of-tree ( #11969 )
2025-01-18 16:13:16 -08:00
youkaichao
6d0e3d3724
[core] clean up executor class hierarchy between v1 and v0 ( #12171 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-18 14:35:15 +08:00
Isotr0py
02798ecabe
[Model] Port deepseek-vl2 processor, remove dependency ( #12169 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-18 13:59:39 +08:00
Russell Bryant
813f249f02
[Docs] Fix broken link in SECURITY.md ( #12175 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-18 04:35:21 +00:00
youkaichao
da02cb4b27
[core] further polish memory profiling ( #12126 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-18 12:25:08 +08:00
Hongxia Yang
c09503ddd6
[AMD][CI/Build][Bugfix] use pytorch stale wheel ( #12172 )
...
Signed-off-by: hongxyan <hongxyan@amd.com >
2025-01-18 11:15:53 +08:00
youkaichao
2b83503227
[misc] fix cross-node TP ( #12166 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-18 10:53:27 +08:00
youkaichao
7b98a65ae6
[torch.compile] disable logging when cache is disabled ( #12043 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-17 20:29:31 +00:00
Gregory Shtrasberg
b5b57e301e
[AMD][FP8] Using MI300 FP8 format on ROCm for block_quant ( #12134 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-01-17 17:12:26 +00:00
Kunshang Ji
54cacf008f
[Bugfix] Mistral tokenizer encode accept list of str ( #12149 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-01-17 16:47:53 +00:00
Wallas Henrique
58fd57ff1d
[Bugfix] Fix score api for missing max_model_len validation ( #12119 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2025-01-17 16:24:22 +00:00
youkaichao
87a0c076af
[core] allow callable in collective_rpc ( #12151 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-17 20:47:01 +08:00
Li, Jiang
d4e6194570
[CI/Build][CPU][Bugfix] Fix CPU CI ( #12150 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-01-17 19:39:52 +08:00
Jee Jee Li
07934cc237
[Misc][LoRA] Improve the readability of LoRA error messages ( #12102 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-17 19:32:28 +08:00
Chen Zhang
69d765f5a5
[V1] Move more control of kv cache initialization from model_executor to EngineCore ( #11960 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-17 07:39:35 +00:00
Divakar Verma
8027a72461
[ROCm][MoE] moe tuning support for rocm ( #12049 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-17 14:49:16 +08:00
Isotr0py
d75ab55f10
[Misc] Add deepseek_vl2 chat template ( #12143 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-17 06:34:48 +00:00
Chen Zhang
d1adb9b403
[BugFix] add more is not None check in VllmConfig.__post_init__ ( #12138 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-17 05:33:22 +00:00
Yuan Tang
b8bfa46a18
[Bugfix] Fix issues in CPU build Dockerfile ( #12135 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-17 12:54:01 +08:00
Yuan Tang
1475847a14
[Doc] Add instructions on using Podman when SELinux is active ( #12136 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-17 04:45:36 +00:00
Kunshang Ji
fead53ba78
[CI]add genai-perf benchmark in nightly benchmark ( #10704 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-01-17 04:15:09 +00:00
Kuntai Du
ebc73f2828
[Bugfix] Fix a path bug in disaggregated prefill example script. ( #12121 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-01-17 11:12:41 +08:00
Chen Zhang
d06e824006
[Bugfix] Set enforce_eager automatically for mllama ( #12127 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-16 15:30:08 -05:00
Isotr0py
62b06ba23d
[Model] Add support for deepseek-vl2-tiny model ( #12068 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-16 17:14:48 +00:00
Varun Sundar Rabindranath
5fd24ec02e
[misc] Add LoRA kernel micro benchmarks ( #11579 )
2025-01-16 15:51:40 +00:00
Roger Wang
874f7c292a
[Bugfix] Fix max image feature size for Llava-one-vision ( #12104 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-16 14:54:06 +00:00
youkaichao
92e793d91a
[core] LLM.collective_rpc interface and RLHF example ( #12084 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-16 20:19:52 +08:00
youkaichao
bf53e0c70b
Support torchrun and SPMD-style offline inference ( #12071 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-16 19:58:53 +08:00
Isotr0py
dd7c9ad870
[Bugfix] Remove hardcoded head_size=256 for Deepseek v2 and v3 ( #12067 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-16 10:11:54 +00:00
Michael Goin
9aa1519f08
Various cosmetic/comment fixes ( #12089 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-16 09:59:06 +00:00
Cyrus Leung
f8ef146f03
[Doc] Add documentation for specifying model architecture ( #12105 )
2025-01-16 15:53:43 +08:00
Elfie Guo
fa0050db08
[Core] Default to using per_token quantization for fp8 when cutlass is supported. ( #8651 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-16 04:31:27 +00:00
tvirolai-amd
cd9d06fb8d
Allow hip sources to be directly included when compiling for rocm. ( #12087 )
2025-01-15 16:46:03 -05:00
Varun Sundar Rabindranath
ebd8c669ef
[Bugfix] Fix _get_lora_device for HQQ marlin ( #12090 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-01-15 19:59:42 +00:00
Roger Wang
70755e819e
[V1][Core] Autotune encoder cache budget ( #11895 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-15 11:29:00 -08:00
Joe Runde
edce722eaa
[Bugfix] use right truncation for non-generative tasks ( #12050 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-01-16 00:31:01 +08:00
maang-h
57e729e874
[Doc]: Update OpenAI-Compatible Server documents ( #12082 )
2025-01-15 16:07:45 +00:00
kewang-xlnx
de0526f668
[Misc][Quark] Upstream Quark format to VLLM ( #10765 )
...
Signed-off-by: kewang-xlnx <kewang@xilinx.com >
Signed-off-by: kewang2 <kewang2@amd.com >
Co-authored-by: kewang2 <kewang2@amd.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-15 11:05:15 -05:00
Yuan
5ecf3e0aaf
Misc: allow to use proxy in HTTPConnection ( #12042 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2025-01-15 13:16:40 +00:00
RunningLeon
97eb97b5a4
[Model]: Support internlm3 ( #12037 )
2025-01-15 11:35:17 +00:00
wangxiyuan
3adf0ffda8
[Platform] Do not raise error if _Backend is not found ( #12023 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
2025-01-15 10:14:15 +00:00
Keyun Tong
ad388d25a8
Type-fix: make execute_model output type optional ( #12020 )
2025-01-15 09:44:56 +00:00
Rahul Tuli
cbe94391eb
Fix: cases with empty sparsity config ( #12057 )
...
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
2025-01-15 17:41:24 +08:00
Chen Zhang
994fc655b7
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager ( #12003 )
2025-01-15 07:55:30 +00:00
Kyle Sayers
3f9b7ab9f5
[Doc] Update examples to remove SparseAutoModelForCausalLM ( #12062 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-01-15 06:36:01 +00:00
youkaichao
ad34c0df0f
[core] platform agnostic executor via collective_rpc ( #11256 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-15 13:45:21 +08:00
Rui Qiao
f218f9c24d
[core] Turn off GPU communication overlap for Ray executor ( #12051 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-01-15 05:19:55 +00:00
Elfie Guo
0794e7446e
[Misc] Add multipstep chunked-prefill support for FlashInfer ( #10467 )
2025-01-15 12:47:49 +08:00
Woosuk Kwon
b7ee940a82
[V1][BugFix] Fix edge case in VLM scheduling ( #12065 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-14 20:21:28 -08:00
Shanshan Shen
9ddac56311
[Platform] move current_memory_usage() into platform ( #11369 )
...
Signed-off-by: Shanshan Shen <467638484@qq.com >
2025-01-15 03:38:25 +00:00
Konrad Zawora
1a51b9f872
[HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py ( #12046 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-01-15 02:59:18 +00:00
Jee Jee Li
42f5e7c52a
[Kernel] Support MulAndSilu ( #11624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-15 02:29:53 +00:00
Jee Jee Li
a3a3ee4e6f
[Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping ( #11924 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-15 07:49:49 +08:00
maang-h
87054a57ab
[Doc]: Update the Json Example of the Engine Arguments document ( #12045 )
2025-01-14 17:03:04 +00:00
Harry Mellor
c9d6ff530b
Explain where the engine args go when using Docker ( #12041 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-14 16:05:50 +00:00
Chen Zhang
a2d2acb4c8
[Bugfix][Kernel] Give unique name to BlockSparseFlashAttention ( #12040 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-14 15:45:05 +00:00
wangxiyuan
2e0e017610
[Platform] Add output for Attention Backend ( #11981 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-14 13:27:04 +00:00
Chen Zhang
1f18adb245
[Kernel] Revert the API change of Attention.forward ( #12038 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-14 20:59:32 +08:00
Cyrus Leung
bb354e6b2d
[Bugfix] Fix various bugs in multi-modal processor ( #12031 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-14 12:16:11 +00:00
youkaichao
ff39141a49
[HPU][misc] add comments for explanation ( #12034 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-14 19:24:06 +08:00
TJian
8a1f938e6f
[Doc] Update Quantization Hardware Support Documentation ( #12025 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-01-14 04:37:52 +00:00
Konrad Zawora
078da31903
[HPU][Bugfix] set_forward_context and CI test execution ( #12014 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-01-14 11:04:18 +08:00
Woosuk Kwon
1a401252b5
[Docs] Add Sky Computing Lab to project intro ( #12019 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-13 17:24:36 -08:00
Steve Luo
f35ec461fc
[Bugfix] Fix deepseekv3 gate bias error ( #12002 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-13 13:43:51 -07:00
Yikun Jiang
289b5191d5
[Doc] Fix build from source and installation link in README.md ( #12013 )
...
Signed-off-by: Yikun <yikunkero@gmail.com >
2025-01-13 17:23:59 +00:00
elijah
c6db21313c
bugfix: Fix signature mismatch in benchmark's get_tokenizer function ( #11982 )
...
Signed-off-by: elijah <f1renze.142857@gmail.com >
2025-01-13 15:22:07 +00:00
Shanshan Shen
a7d59688fb
[Platform] Move get_punica_wrapper() function to Platform ( #11516 )
...
Signed-off-by: Shanshan Shen <467638484@qq.com >
2025-01-13 13:12:10 +00:00
youkaichao
458e63a2c6
[platform] add device_control env var ( #12009 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-13 20:59:09 +08:00
Harry Mellor
e8c23ff989
[Doc] Organise installation documentation into categories and tabs ( #11935 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-13 12:27:36 +00:00
Roger Wang
cd8249903f
[Doc][V1] Update model implementation guide for V1 support ( #11998 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-01-13 11:58:54 +00:00
Chen Zhang
0f8cafe2d1
[Kernel] unified_attention for Attention.forward ( #11967 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-13 19:28:53 +08:00
Alex Brooks
5340a30d01
Fix Max Token ID for Qwen-VL-Chat ( #11980 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-01-13 08:37:48 +00:00
youkaichao
89ce62a316
[platform] add ray_device_key ( #11948 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-13 16:20:52 +08:00
Chenguang Li
c3f05b09a0
[Misc]Minor Changes about Worker ( #11555 )
...
Signed-off-by: Chenguang Li <757486878@qq.com >
2025-01-13 15:47:05 +08:00
Concurrensee
cf6bbcb493
[Misc] Fix Deepseek V2 fp8 kv-scale remapping ( #11947 )
...
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
2025-01-12 23:05:06 -08:00
Sungjae Lee
80ea3af1a0
[CI][Spec Decode] fix: broken test for EAGLE model ( #11972 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
2025-01-13 06:50:35 +00:00
Siyuan Li
9dd02d85ca
[Bug] Fix usage of .transpose() and .view() consecutively. ( #11979 )
2025-01-13 06:24:10 +00:00
Yangcheng Li
f7b3ba82c3
[MISC] fix typo in kv transfer send recv test ( #11983 )
2025-01-13 05:07:48 +00:00
Robert Shaw
619ae268c3
[V1] [2/n] Logging and Metrics - OutputProcessor Abstraction ( #11973 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-13 04:54:10 +00:00
Isotr0py
d14e98d924
[Model] Support GGUF models newly added in transformers 4.46.0 ( #9685 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-13 00:13:44 +00:00
Robert Shaw
9597a095f2
[V1][Core][1/n] Logging and Metrics ( #11962 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-12 21:02:02 +00:00
Avshalom Manevich
263a870ee1
[Hardware][TPU] workaround fix for MoE on TPU ( #11764 )
2025-01-12 10:53:51 -05:00
Akshat Tripathi
8bddb73512
[Hardware][CPU] Multi-LoRA implementation for the CPU backend ( #11100 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Oleg Mosalov <oleg@krai.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Oleg Mosalov <oleg@krai.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-12 13:01:52 +00:00
Isotr0py
f967e51f38
[Model] Initialize support for Deepseek-VL2 models ( #11578 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-12 00:17:24 -08:00
Rafael Vasquez
43f3d9e699
[CI/Build] Add markdown linter ( #11857 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2025-01-12 00:17:13 -08:00
Roger Wang
b25cfab9a0
[V1] Avoid sending text prompt to core engine ( #11963 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-12 06:36:38 +00:00
sixgod
4b657d3292
[Model] Add cogagent model support vLLM ( #11742 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-11 19:05:56 +00:00
Nicolò Lucchesi
d697dc01b4
[Bugfix] Fix RobertaModel loading ( #11940 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-01-11 14:05:09 +00:00
Cyrus Leung
a991f7d508
[Doc] Basic guide for writing unit tests for new models ( #11951 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-11 21:27:24 +08:00
Cyrus Leung
7a3a83e3b8
[CI/Build] Move model-specific multi-modal processing tests ( #11934 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-11 13:50:05 +08:00
shaochangxu
c32a7c7c0c
[Bugfix] fused_experts_impl wrong compute type for float32 ( #11921 )
...
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
2025-01-11 13:49:39 +08:00
Sungjae Lee
2118d0565c
[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design ( #11672 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
2025-01-10 20:49:38 -08:00
youkaichao
899136b857
[ci] fix broken distributed-tests-4-gpus ( #11937 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-11 09:07:24 +08:00
Fred Reiss
c9f09a4fe8
[mypy] Fix mypy warnings in api_server.py ( #11941 )
...
Signed-off-by: Fred Reiss <frreiss@us.ibm.com >
2025-01-11 01:04:58 +00:00
Travis Johnson
d45cbe70f5
[Bugfix] Check that number of images matches number of <|image|> tokens with mllama ( #11939 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-01-10 23:26:00 +00:00
minmin
8a579408f3
[Misc] Update benchmark_prefix_caching.py fixed example usage ( #11920 )
...
Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn >
Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn >
2025-01-10 20:39:22 +00:00
Isotr0py
46fa98ccad
[Misc] Clean up debug code in Deepseek-V3 ( #11930 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-10 19:19:15 +00:00
Li, Jiang
aa1e77a19c
[Hardware][CPU] Support MOE models on x86 CPU ( #11831 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-01-10 11:07:58 -05:00
Kuntai Du
5959564f94
Doc fix in benchmark_long_document_qa_throughput.py ( #11933 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-01-10 23:51:43 +08:00
Kuntai Du
f33e033e27
[Docs] Fix docstring in get_ip function ( #11932 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-01-10 23:51:02 +08:00
Harry Mellor
482cdc494e
[Doc] Rename offline inference examples ( #11927 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-10 23:50:29 +08:00
wangxiyuan
20410b2fda
[platform] support custom torch.compile backend key ( #11318 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-01-10 23:46:51 +08:00
Cyrus Leung
12664ddda5
[Doc] [1/N] Initial guide for merged multi-modal processor ( #11925 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-10 14:30:25 +00:00
youkaichao
241ad7b301
[ci] Fix sampler tests ( #11922 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-10 20:45:33 +08:00
Harry Mellor
d85c47d6ad
Replace "online inference" with "online serving" ( #11923 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-10 12:05:56 +00:00
wangxiyuan
ef725feafc
[platform] support pytorch custom op pluggable ( #11328 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-10 10:02:38 +00:00
cennn
d907be7dc7
[misc] remove python function call for custom activation op ( #11885 )
...
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-01-10 17:18:25 +08:00
youkaichao
d53575a5f0
[ci] fix gh200 tests ( #11919 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-10 16:25:17 +08:00
Kunshang Ji
61af633256
[BUGFIX] Fix UnspecifiedPlatform package name ( #11916 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-01-10 16:20:46 +08:00
Joe Runde
ac2f3f7fee
[Bugfix] Validate lora adapters to avoid crashing server ( #11727 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-10 15:56:36 +08:00
Chen Zhang
cf5f000d21
[torch.compile] Hide KV cache behind torch.compile boundary ( #11677 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-10 13:14:42 +08:00
Cyrus Leung
3de2b1eafb
[Doc] Show default pooling method in a table ( #11904 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-10 11:25:20 +08:00
Cyrus Leung
b844b99ad3
[VLM] Enable tokenized inputs for merged multi-modal processor ( #11900 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-10 03:24:00 +00:00
Cyrus Leung
c3cf54dda4
[Doc][5/N] Move Community and API Reference to the bottom ( #11896 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-01-10 03:10:12 +00:00
Charles Frye
36f5303578
[Docs] Add Modal to deployment frameworks ( #11907 )
2025-01-09 23:26:37 +00:00
Cyrus Leung
9a228348d2
[Misc] Provide correct Pixtral-HF chat template ( #11891 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-09 10:19:37 -07:00
youkaichao
bd82872211
[ci]try to fix flaky multi-step tests ( #11894 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-09 14:47:29 +00:00
wangxiyuan
405eb8e396
[platform] Allow platform specify attention backend ( #11609 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
2025-01-09 21:46:50 +08:00
Cyrus Leung
65097ca0af
[Doc] Add model development API Reference ( #11884 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-09 09:43:40 +00:00
Ye (Charlotte) Qi
1d967acb45
[Bugfix] fix beam search input errors and latency benchmark script ( #11875 )
...
Signed-off-by: Ye Qi <yeq@meta.com >
Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com >
2025-01-09 17:36:39 +08:00
Cyrus Leung
0bd1ff4346
[Bugfix] Override dunder methods of placeholder modules ( #11882 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-09 09:02:53 +00:00
youkaichao
310aca88c9
[perf]fix current stream ( #11870 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-09 07:18:21 +00:00
Guspan Tanadi
a732900efc
[Doc] Intended links Python multiprocessing library ( #11878 )
2025-01-09 05:39:39 +00:00
Cyrus Leung
d848800e88
[Misc] Move print_*_once from utils to logger ( #11298 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com >
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com >
2025-01-09 12:48:12 +08:00
Michael Goin
730e9592e9
[Doc] Recommend uv and python 3.12 for quickstart guide ( #11849 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-09 11:37:48 +08:00
Maximilien de Bayser
1fe554bac3
treat do_lower_case in the same way as the sentence-transformers library ( #11815 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-01-09 11:05:43 +08:00
Tyler Michael Smith
615e4a5401
[CI] Turn on basic correctness tests for V1 ( #10864 )
2025-01-08 21:20:44 -05:00
Simon Mo
3db0cafdf1
[Docs] Add Google Cloud Meetup ( #11864 )
2025-01-08 12:38:28 -08:00
rasmith
526de822d5
[Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models ( #11698 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-01-08 20:23:15 +00:00
Robert Shaw
56fe4c297c
[TPU][Quantization] TPU W8A8 ( #11785 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-08 19:33:29 +00:00
WangErXiao
47de8821d3
[Misc]add some explanations for BlockHashType ( #11847 )
2025-01-08 18:21:30 +00:00
Cyrus Leung
5984499e47
[Doc] Expand Multimodal API Reference ( #11852 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 17:14:14 +00:00
Cyrus Leung
ca47e176af
[Misc] Move some model utils into vision file ( #11848 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 17:04:46 +00:00
Yan Ma
78f4590b60
[Bugfix][XPU] fix silu_and_mul ( #11823 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-01-09 00:11:50 +08:00
Li, Jiang
2f7024987e
[CI/Build][Bugfix] Fix CPU CI image clean up ( #11836 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-01-08 15:18:28 +00:00
Cyrus Leung
6cd40a5bfe
[Doc][4/N] Reorganize API Reference ( #11843 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 21:34:44 +08:00
Harry Mellor
aba8d6ee00
[Doc] Move examples into categories ( #11840 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-08 13:09:53 +00:00
Cyrus Leung
2a0596bc48
[VLM] Reorganize profiling/processing-related code ( #11812 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 18:59:58 +08:00
youkaichao
f12141170a
[torch.compile] consider relevant code in compilation cache ( #11614 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-08 10:46:43 +00:00
Wallas Henrique
cfd3219f58
[Hardware][Apple] Native support for macOS Apple Silicon ( #11696 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-08 16:35:49 +08:00
Simon Mo
a1b2b8606e
[Docs] Update sponsor name: 'Novita' to 'Novita AI' ( #11833 )
2025-01-07 23:05:46 -08:00
youkaichao
ad9f1aa679
[doc] update wheels url ( #11830 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-08 14:36:49 +08:00
youkaichao
889e662eae
[misc] improve memory profiling ( #11809 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-08 06:36:03 +00:00
Cyrus Leung
ef68eb28d8
[Bug] Fix pickling of ModelConfig when RunAI Model Streamer is used ( #11825 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 13:40:09 +08:00
Simon Mo
259abd8953
[Docs] reorganize sponsorship page ( #11639 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-01-07 21:16:08 -08:00
Jee Jee Li
f645eb6954
[Bugfix] Add checks for LoRA and CPU offload ( #11810 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-08 13:08:48 +08:00
Ilya Lavrenov
f4923cb8bc
[OpenVINO] Fixed Docker.openvino build ( #11732 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com >
2025-01-08 13:08:30 +08:00
Nishidha
b640b19cc0
Fixed docker build for ppc64le ( #11518 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-01-08 13:05:37 +08:00
WangErXiao
dc71af0a71
Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… ( #11824 )
2025-01-08 04:09:25 +00:00
Divakar Verma
4d29e91be8
[Misc] sort torch profiler table by kernel timing ( #11813 )
2025-01-08 10:57:04 +08:00
Cyrus Leung
91445c7bc8
[Bugfix] Fix image input for Pixtral-HF ( #11741 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 10:17:16 +08:00
Harry Mellor
5950f555a1
[Doc] Group examples into categories ( #11782 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-08 09:20:12 +08:00
Jie Fu (傅杰)
a4e2b26856
[Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 ( #11794 )
2025-01-07 16:15:50 -08:00
sroy745
973f5dc581
[Doc]Add documentation for using EAGLE in vLLM ( #11417 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2025-01-07 19:19:12 +00:00
jiangjiadi
c994223d56
[Bugfix] update the prefix for qwen2 ( #11795 )
...
Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com >
2025-01-07 18:36:34 +00:00
youkaichao
869579a702
[optimization] remove python function call for custom op ( #11750 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-07 17:04:28 +00:00
Cyrus Leung
c0efe92d8b
[Doc] Add note to gte-Qwen2 models ( #11808 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 21:50:58 +08:00
youkaichao
d9fa1c05ad
[doc] update how pip can install nightly wheels ( #11806 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-07 21:42:58 +08:00
Roger Wang
2de197bdd4
[V1] Support audio language models on V1 ( #11733 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-07 19:47:36 +08:00
youkaichao
869e829b85
[doc] add doc to explain how to use uv ( #11773 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-07 18:41:17 +08:00
Cyrus Leung
8f37be38eb
[Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation ( #11800 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 18:25:02 +08:00
Roger Wang
8082ad7950
[V1][Doc] Update V1 support for LLaVa-NeXT-Video ( #11798 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-07 09:55:39 +00:00
Yuan
1e4ce295ae
[CI][CPU] adding build number to docker image name ( #11788 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2025-01-07 07:28:01 +00:00
Russell Bryant
ce1917fcf2
[Doc] Create a vulnerability management team ( #9925 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-06 22:57:32 -08:00
XiaobingZhang
e512f76a89
fix init error for MessageQueue when n_local_reader is zero ( #11768 )
2025-01-07 06:12:48 +00:00
Liangfu Chen
898cdf033e
[CI] Fix neuron CI and run offline tests ( #11779 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-01-06 21:36:10 -08:00
Roger Wang
0f3f3c86ec
[Bugfix] Update attention interface in Whisper ( #11784 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-07 04:36:24 +00:00
Jee Jee Li
b278557935
[Kernel][LoRA]Punica prefill kernels fusion ( #11234 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Co-authored-by: Zhonghua Deng <abatom@163.com >
2025-01-07 04:01:39 +00:00
Cyrus Leung
8ceffbf315
[Doc][3/N] Reorganize Serving section ( #11766 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 11:20:01 +08:00
YiSheng5
d93d2d74fd
[XPU] Make pp group initilized for pipeline-parallelism ( #11648 )
...
Signed-off-by: yisheng <yi.sheng@intel.com >
2025-01-07 11:09:58 +08:00
Cyrus Leung
d0169e1b0f
[Model] Future-proof Qwen2-Audio multi-modal processor ( #11776 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 11:05:17 +08:00
Cyrus Leung
08fb75c72e
[Bugfix] Fix LLaVA-NeXT feature size precision error (for real) ( #11772 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 01:10:54 +00:00
Roger Wang
91b361ae89
[V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision ( #11685 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 19:58:16 +00:00
Chen Zhang
e20c92bb61
[Kernel] Move attn_type to Attention.__init__() ( #11690 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-07 00:11:28 +08:00
Jee Jee Li
32c9eff2ff
[Bugfix][V1] Fix molmo text-only inputs ( #11676 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-06 15:22:25 +00:00
youkaichao
4ca5d40adc
[doc] explain how to add interleaving sliding window support ( #11771 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-06 21:57:44 +08:00
Roger Wang
9279b9f83d
[Bugfix] Fix max image size for LLaVA-Onevision ( #11769 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-06 13:48:53 +00:00
Cyrus Leung
ee77fdb5de
[Doc][2/N] Reorganize Models and Usage sections ( #11755 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 21:40:31 +08:00
Cyrus Leung
996357e480
[VLM] Separate out profiling-related logic ( #11746 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 16:02:21 +08:00
Suraj Deshmukh
2a622d704a
k8s-config: Update the secret to use stringData ( #11679 )
...
Signed-off-by: Suraj Deshmukh <surajd.service@gmail.com >
2025-01-06 08:01:22 +00:00
Lucas Tucker
9c749713f6
[mypy] Forward pass function type hints in lora ( #11740 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2025-01-06 07:59:36 +00:00
Rui Qiao
022c5c6944
[V1] Refactor get_executor_cls ( #11754 )
2025-01-06 07:59:16 +00:00
Rui Qiao
f8fcca100b
[Misc] Fix typo for valid_tool_parses ( #11753 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-01-06 07:12:38 +00:00
Woosuk Kwon
06bfb51963
[V1] Add BlockTable class ( #11693 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-06 14:24:42 +09:00
Cody Yu
408e560015
[Bugfix] Remove block size constraint ( #11723 )
2025-01-06 12:49:55 +08:00
Cyrus Leung
402d378360
[Doc] [1/N] Reorganize Getting Started section ( #11645 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 02:18:33 +00:00
cennn
9e764e7b10
[distributed] remove pynccl's redundant change_state ( #11749 )
2025-01-06 09:05:48 +08:00
Robert Shaw
33fc1e2e86
[Frontend] Improve StreamingResponse Exception Handling ( #11752 )
2025-01-05 16:35:01 -05:00
Lancer
eba17173d3
fix: [doc] fix typo ( #11751 )
...
Co-authored-by: Lancer <maruixiang6688@gmail.com >
2025-01-06 00:48:16 +08:00
cennn
635b897246
[distributed] remove pynccl's redundant stream ( #11744 )
2025-01-05 23:09:11 +08:00
Lu Fang
4068f4b5b5
[MISC] Replace c10::optional with std::optional ( #11730 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-05 10:20:34 +09:00
Jee Jee Li
47831430cc
[Bugfix][V1] Fix test_kv_cache_utils.py ( #11738 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-04 16:07:59 +00:00
Cyrus Leung
65c08928c2
[Model] Remove unnecessary weight initialization logic ( #11736 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-04 23:46:21 +08:00
Cyrus Leung
ba214dffbe
[Bugfix] Fix precision error in LLaVA-NeXT ( #11735 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-04 23:45:57 +08:00
Cyrus Leung
eed11ebee9
[VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision ( #11717 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-04 11:40:53 +00:00
Yan Burman
300acb8347
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture ( #11233 )
...
Signed-off-by: Yan Burman <yanburman@users.noreply.github.com >
Signed-off-by: Ido Asraff <idoa@atero.ai >
2025-01-04 14:50:16 +08:00
xcnick
d91457d529
[V1] Add kv cache utils tests. ( #11513 )
...
Signed-off-by: xcnick <xcnick0412@gmail.com >
2025-01-04 14:49:46 +08:00
Kunshang Ji
fbf2564554
[V1] Add RayExecutor support for AsyncLLM (api server) ( #11712 )
2025-01-04 06:41:31 +00:00
Alberto Ferrer
d1d49397e7
Update bnb.md with example for OpenAI ( #11718 )
2025-01-04 06:29:02 +00:00
Hust_YangXian
9c93636d84
Update tool_calling.md ( #11701 )
2025-01-04 06:16:30 +00:00
WangErXiao
e5d7ed0c53
[V1] log GPU blocks num for MultiprocExecutor ( #11656 )
2025-01-04 00:13:12 +00:00
Robert Shaw
ad0d567e1c
[V1] Chore: cruft removal ( #11724 )
2025-01-03 23:25:02 +00:00
Michael Goin
bf0d97d786
Update requirements-tpu.txt to support python 3.9 and 3.11 ( #11695 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-03 22:36:46 +00:00
Jee Jee Li
a655eb3025
[Misc]Add BNB quantization for Qwen2VL ( #11719 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-03 15:19:02 -07:00
Robert Shaw
1543914c04
[V1] Improve TP>1 Error Handling + Stack Trace ( #11721 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-03 21:29:11 +00:00
ZincCat
61fed92c7e
[Bugfix] Fix ColumnParallelLinearWithLoRA slice ( #11708 )
...
Signed-off-by: ZincCat <zincchloride@outlook.com >
2025-01-03 21:02:34 +00:00
Robert Shaw
80c751e7f6
[V1] Simplify Shutdown ( #11659 )
2025-01-03 17:25:38 +00:00
Aurick Qiao
e1a5c2f0a1
[Model] Whisper model implementation ( #11280 )
...
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com >
2025-01-03 16:39:19 +08:00
Kevin H. Luu
fd3a62a122
[perf-benchmark] Fix dependency for steps in benchmark pipeline ( #11710 )
2025-01-02 22:38:37 -08:00
Lu Fang
07064cb1d4
[Bugfix] Check chain_speculative_sampling before calling it ( #11673 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-02 16:58:56 -08:00
Sachin Varghese
2f1e8e8f54
Update default max_num_batch_tokens for chunked prefill ( #11694 )
2025-01-03 00:25:53 +00:00
Nathan Azrak
68d37809b9
[Misc] Minimum requirements for SageMaker compatibility ( #11576 )
2025-01-02 15:59:25 -08:00
wchen61
5dba257506
Resolve race conditions in Marlin kernel ( #11493 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2025-01-02 22:58:56 +00:00
bjmsong
187e32997c
[Bugfix] Change kv scaling factor by param json on nvidia gpu ( #11688 )
...
Signed-off-by: bjmsong <bjmsong@126.com >
Co-authored-by: bjmsong <bjmsong@126.com >
2025-01-02 21:11:39 +00:00
Woosuk Kwon
b55ed6ef8a
[V1][Minor] Optimize token_ids_cpu copy ( #11692 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-02 12:04:58 -07:00
Kathy Yu
2f385183f3
[Bugfix] Free cross attention block table for preempted-for-recompute sequence group. ( #10013 )
...
Signed-off-by: Kathy Yu <feiyangyu@google.com >
2025-01-02 10:28:09 -08:00
Chunyang Wen
84c35c374a
According to vllm.EngineArgs, the name should be distributed_executor_backend ( #11689 )
2025-01-02 18:14:16 +00:00
Cyrus Leung
8c38ee7007
[VLM] Merged multi-modal processor for LLaVA-NeXT ( #11682 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-02 16:39:27 +00:00
Tobias Pitters
b6087a6bee
[mypy] Pass type checking in vllm/inputs ( #11680 )
...
Signed-off-by: Tobias Pitters <tobias.pitters@gmail.com >
2025-01-02 16:18:15 +00:00
Cyrus Leung
23c1b10a4c
[VLM][Bugfix] Multi-modal processor compatible with V1 multi-input ( #11674 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-02 17:00:00 +08:00
Cyrus Leung
a115ac46b5
[VLM] Move supported limits and max tokens to merged multi-modal processor ( #11669 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-01 15:44:42 +00:00
Woosuk Kwon
73001445fb
[V1] Implement Cascade Attention ( #11635 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-01 21:56:46 +09:00
Kazuhiro Serizawa
6d70198b17
[Doc] Fix typo ( #11666 )
...
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com >
2025-01-01 08:10:10 +00:00
Lu Fang
f962f426bc
[Misc] Replace space with - in the file names ( #11667 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-01 07:39:30 +00:00
Jee Jee Li
11d8a091c6
[Misc] Optimize Qwen2-VL LoRA test ( #11663 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-01 14:42:23 +08:00
Cyrus Leung
365801fedd
[VLM] Add max-count checking in data parser for single image models ( #11661 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-31 22:15:21 -08:00
Joe Runde
4db72e57f6
[Bugfix][Refactor] Unify model management in frontend ( #11660 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-01-01 02:21:51 +00:00
Yihua Cheng
0c6f998554
[Benchmark] Add benchmark script for CPU offloading ( #11533 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: KuntaiDu <kuntai@uchicago.edu >
2025-01-01 00:10:55 +00:00
Roger Wang
e7c7c5e822
[V1][VLM] V1 support for selected single-image models. ( #11632 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-31 21:17:22 +00:00
Chen Zhang
8c3230d8c1
[V1] Simpify vision block hash for prefix caching by removing offset from hash ( #11646 )
2024-12-31 08:56:01 +00:00
sakunkun
2c5718809b
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. ( #11565 )
2024-12-31 06:29:04 +00:00
John Giorgi
82c49d3260
[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) ( #6909 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-30 22:15:58 -08:00
Michael Goin
74fa1d123c
[Bugfix] Fix OpenAI parallel sampling when using xgrammar ( #11637 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-31 03:43:54 +00:00
Matthias Vogler
a2a40bcd0d
[Model][LoRA]LoRA support added for MolmoForCausalLM ( #11439 )
...
Signed-off-by: Matthias Vogler <matthias.vogler@joesecurity.org >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Matthias Vogler <matthias.vogler@joesecurity.org >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-30 17:33:06 -08:00
Kevin H. Luu
ccb1aabcca
[benchmark] Remove dependency for H100 benchmark step ( #11572 )
2024-12-30 12:27:07 -08:00
whyiug
36e7670045
[Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel ( #11631 )
2024-12-30 18:51:04 +00:00
Robert Shaw
5886aa496e
[V1] [6/N] API Server: Better Shutdown ( #11586 )
2024-12-30 15:51:02 +00:00
Cyrus Leung
8d9b6721e7
[VLM] Abstract out multi-modal data parsing in merged processor ( #11620 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-30 15:01:35 +00:00
youkaichao
b12e87f942
[platforms] enable platform plugins ( #11602 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-30 20:24:45 +08:00
Li, Jiang
5dbf854553
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels ( #11618 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-12-30 10:17:04 +00:00
Tyler Michael Smith
970d6d0776
[Build][Kernel] Update CUTLASS to v3.6.0 ( #11607 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-30 17:22:13 +08:00
Liangfu Chen
628ec6c17b
[Docker] bump up neuron sdk v2.21 ( #11593 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2024-12-30 13:46:14 +08:00
youkaichao
3682e33f9f
[v1] fix compilation cache ( #11598 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-30 04:24:12 +00:00
Michael Goin
0aa38d16f5
Remove print statement in DeepseekScalingRotaryEmbedding ( #11604 )
2024-12-29 20:16:46 +00:00
Kuntai Du
faef77c0d6
[Misc] KV cache transfer connector registry ( #11481 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2024-12-29 16:08:09 +00:00
youkaichao
dba4d9dec6
[v1][bugfix] fix cudagraph with inplace buffer assignment ( #11596 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-29 09:03:49 +00:00
Cyrus Leung
32b4c63f02
[Doc] Convert list tables to MyST ( #11594 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-29 15:56:22 +08:00
Robert Shaw
4fb8e329fd
[V1] [5/N] API Server: unify Detokenizer and EngineCore input ( #11545 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-12-28 20:51:57 +00:00
youkaichao
328841d002
[bugfix] interleaving sliding window for cohere2 model ( #11583 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-28 16:55:42 +00:00
Cyrus Leung
d427e5cfda
[Doc] Minor documentation fixes ( #11580 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-28 21:53:59 +08:00
Woosuk Kwon
42bb201fd6
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor ( #11581 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-28 13:33:12 +00:00
hj-wei
59d6bb4c86
[Hardware][AMD]: Replace HIPCC version with more precise ROCm version ( #11515 )
...
Signed-off-by: hjwei <hjwei_xd@163.com >
2024-12-28 11:17:35 +00:00
Roger Wang
b7dcc003dc
[Model] Remove hardcoded image tokens ids from Pixtral ( #11582 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-28 10:54:23 +00:00
Isotr0py
d34be24bb1
[Model] Support InternLM2 Reward models ( #11571 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-28 06:14:10 +00:00
Rajveer Bachkaniwala
b5cbe8eeb3
[Bugfix] Last token measurement fix ( #11376 )
...
Signed-off-by: rajveerb <46040700+rajveerb@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-12-28 11:34:46 +08:00
Robert Shaw
df04dffade
[V1] [4/N] API Server: ZMQ/MP Utilities ( #11541 )
2024-12-28 01:45:08 +00:00
Chen Zhang
a60731247f
[Doc] Update mllama example based on official doc ( #11567 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2024-12-28 00:31:10 +00:00
Selali
ac79799403
[Bugfix] Fix for ROCM compressed tensor support ( #11561 )
2024-12-27 20:12:11 +00:00
Isotr0py
dde1fa18c9
[Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix ( #11566 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-27 19:45:13 +00:00
Jee Jee Li
0240402c46
[Misc]Add BNB quantization for MolmoForCausalLM ( #11551 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-27 18:48:24 +00:00
ErezSC42
55509c2114
[MODEL] LoRA support for Jamba model ( #11209 )
...
Signed-off-by: Erez Schwartz <erezs@ai21.com >
2024-12-27 17:58:21 +00:00
Cyrus Leung
101418096f
[VLM] Support caching in merged multi-modal processor ( #11396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-27 17:22:48 +00:00
Chen1022
5ce4627a7e
[Doc] Add xgrammar in doc ( #11549 )
...
Signed-off-by: ccjincong <chenjincong11@gmail.com >
2024-12-27 13:05:10 +00:00
Cyrus Leung
7af553ea30
[Misc] Abstract the logic for reading and writing media content ( #11527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-27 19:21:23 +08:00
Jee Jee Li
2c9b8ea2b0
[Bugfix] Fix TeleChat2ForCausalLM weights mapper ( #11546 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-27 10:39:15 +00:00
AlexHe99
d003f3ea39
Update deploying_with_k8s.md with AMD ROCm GPU example ( #11465 )
...
Signed-off-by: Alex He <alehe@amd.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-27 10:00:04 +00:00
Mengqing Cao
6c6f7fe8a8
[Platform] Move model arch check to platform ( #11503 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-12-27 08:45:25 +00:00
Robert Shaw
2339d59f92
[BugFix] Fix quantization for all other methods ( #11547 )
Create Release / Create Release (push) Has been cancelled
2024-12-26 22:23:29 -08:00
Robert Shaw
1b875a0ef3
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly ( #11534 )
2024-12-26 21:19:21 -08:00
youkaichao
eb881ed006
[misc] fix typing ( #11540 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-27 11:05:08 +08:00
Robert Shaw
46d4359450
[CI] Fix broken CI ( #11543 )
2024-12-26 18:49:16 -08:00
Woosuk Kwon
81b979f2a8
[V1] Fix yapf ( #11538 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-27 09:47:10 +09:00
Woosuk Kwon
371d04d39b
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling ( #11394 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-27 09:32:38 +09:00
Robert Shaw
0c0c2015c5
Update openai_compatible_server.md ( #11536 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-12-26 16:26:18 -08:00
Simon Mo
82d24f7aac
[Docs] Document Deepseek V3 support ( #11535 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-12-26 16:21:56 -08:00
Simon Mo
f49777ba62
Deepseek v3 ( #11502 )
...
Create Release / Create Release (push) Has been cancelled
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com >
2024-12-26 16:09:44 -08:00
Robert Shaw
55fb97f7bd
[2/N] API Server: Avoid ulimit footgun ( #11530 )
2024-12-26 23:43:05 +00:00
Michael Goin
2072924d14
[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization ( #11523 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: HandH1998 <1335248067@qq.com >
2024-12-26 15:33:30 -08:00
Robert Shaw
720b10fdc6
[1/N] API Server (Remove Proxy) ( #11529 )
2024-12-26 23:03:43 +00:00
Isotr0py
b85a977822
[Doc] Add video example to openai client for multimodal ( #11521 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-26 17:31:29 +00:00
Cyrus Leung
eec906d811
[Misc] Add placeholder module ( #11501 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-26 13:12:51 +00:00
Jee Jee Li
f57ee5650d
[Model] Modify MolmoForCausalLM MLP ( #11510 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-26 13:12:05 +00:00
sroy745
dcb1a944d4
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler ( #10681 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-26 19:02:58 +09:00
Roger Wang
7492a36207
[Doc] Add QVQ and QwQ to the list of supported models ( #11509 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-12-26 09:44:32 +00:00
Jee Jee Li
aa25985bd1
[Misc][LoRA] Fix LoRA weight mapper ( #11495 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-26 15:52:48 +08:00
Lucas Tucker
dbeac95dbb
Mypy checking for vllm/compilation ( #11496 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2024-12-26 05:04:07 +00:00
Cyrus Leung
51a624bf02
[Misc] Move some multimodal utils to modality-specific modules ( #11494 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-26 04:23:20 +00:00
Cyrus Leung
6ad909fdda
[Doc] Improve GitHub links ( #11491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-25 14:49:26 -08:00
Cyrus Leung
b689ada91e
[Frontend] Enable decord to load video from base64 ( #11492 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-25 16:33:55 +00:00
Jiaxin Shan
fc601665eb
[Misc] Update disaggregation benchmark scripts and test logs ( #11456 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-25 06:58:48 +00:00
Rui Qiao
9832e5572a
[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor ( #11472 )
2024-12-24 19:49:46 -08:00
Cyrus Leung
3f3e92e1f2
[Model] Automatic conversion of classification and reward models ( #11469 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-24 18:22:22 +00:00
Yuan Tang
409475a827
[Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 ( #11435 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-24 16:53:28 +00:00
Jee Jee Li
196c34b0ac
[Misc] Move weights mapper ( #11443 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-24 13:05:25 +00:00
Mengqing Cao
5c7963249d
[attn][tiny fix] fix attn backend in MultiHeadAttention ( #11463 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-12-24 12:39:36 +00:00
Ilya Lavrenov
461cde2080
[OpenVINO] Fixed installation conflicts ( #11458 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com >
2024-12-24 11:38:21 +00:00
Isotr0py
7a5286cc04
[Bugfix][Hardware][CPU] Fix CPU input_positions creation for text-only inputs with mrope ( #11434 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-24 17:59:51 +08:00
Jee Jee Li
b1b1038fbd
[Bugfix] Fix Qwen2-VL LoRA weight loading ( #11430 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-24 09:56:10 +00:00
Cyrus Leung
9edca6bf8f
[Frontend] Online Pooling API ( #11457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-24 17:54:30 +08:00
dpxa
4f074fbf53
[Misc]Suppress irrelevant exception stack trace information when CUDA… ( #11438 )
...
Co-authored-by: shiquan <shiquan>
2024-12-24 08:43:39 +00:00
Rui Qiao
a491d6f535
[V1] TP Ray executor ( #11107 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-12-23 23:00:12 +00:00
Rafael Vasquez
32aa2059ad
[Docs] Convert rST to MyST (Markdown) ( #11145 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-12-23 22:35:38 +00:00
yansh97
94d545a1a1
[Doc] Fix typo in the help message of '--guided-decoding-backend' ( #11440 )
2024-12-23 20:20:44 +00:00
Michael Goin
60fb4f3bcf
[Bugfix] Add kv cache scales to gemma2.py ( #11269 )
2024-12-23 19:30:45 +00:00
Michael Goin
63afbe9215
[CI] Expand OpenAI test_chat.py guided decoding tests ( #11048 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-23 18:35:38 +00:00
Dipika Sikka
8cef6e02dc
[Misc] add w8a8 asym models ( #11075 )
2024-12-23 13:33:20 -05:00
Dipika Sikka
b866cdbd05
[Misc] Add assertion and helpful message for marlin24 compressed models ( #11388 )
2024-12-24 02:23:38 +08:00
Yuan Tang
2e726680b3
[Bugfix] torch nightly version in ROCm installation guide ( #11423 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-23 17:20:22 +00:00
Michael Goin
5bfb30a529
[Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF ( #11389 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-23 23:06:20 +08:00
Lucas Tucker
e51719ae72
mypy type checking for vllm/worker ( #11418 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2024-12-23 13:55:49 +00:00
youkaichao
f30581c518
[misc][perf] remove old code ( #11425 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-23 08:01:08 +00:00
Simon Mo
048fc57a0f
[CI] Unboock H100 Benchmark ( #11419 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-12-22 14:17:43 -08:00
Jason T. Greene
f1d1bf6288
[Bugfix] Fix fully sharded LoRAs with Mixtral ( #11390 )
...
Signed-off-by: Jason Greene <jason.greene@redhat.com >
2024-12-22 23:25:10 +08:00
youkaichao
72d9c316d3
[cd][release] fix race conditions ( #11407 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-22 00:39:11 -08:00
youkaichao
4a9139780a
[cd][release] add pypi index for every commit and nightly build ( #11404 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-12-21 23:53:44 -08:00
Roger Wang
29c748930e
[CI] Fix flaky entrypoint tests ( #11403 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-21 21:08:44 -08:00
Roger Wang
c2d1b075ba
[Bugfix] Fix issues for Pixtral-Large-Instruct-2411 ( #11393 )
...
Signed-off-by: ywang96 <ywang@example.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-21 10:15:03 +00:00
Ricky Xu
584f0ae40d
[V1] Make AsyncLLMEngine v1-v0 opaque ( #11383 )
...
Signed-off-by: Ricky Xu <xuchen727@hotmail.com >
2024-12-21 15:14:08 +08:00
George
51ff216d85
[Bugfix] update should_ignore_layer ( #11354 )
...
Signed-off-by: George Ohashi <george@neuralmagic.com >
2024-12-21 06:36:23 +00:00
Woosuk Kwon
dd2b5633dd
[V1][Bugfix] Skip hashing empty or None mm_data ( #11386 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-21 14:22:21 +09:00
Jiaxin Shan
47a0b615b4
Add ray[default] to wget to run distributed inference out of box ( #11265 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-20 13:54:55 -08:00
youkaichao
5d2248d81a
[doc] explain nccl requirements for rlhf ( #11381 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-20 13:00:56 -08:00
Michael Goin
d573aeadcc
[Bugfix] Don't log OpenAI field aliases as ignored ( #11378 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-20 19:03:50 +00:00
omer-dayan
995f56236b
[Core] Loading model from S3 using RunAI Model Streamer as optional loader ( #10192 )
...
Signed-off-by: OmerD <omer@run.ai >
2024-12-20 16:46:24 +00:00
Daniele
7c7aa37c69
[CI/Build] fix pre-compiled wheel install for exact tag ( #11373 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2024-12-21 00:14:40 +08:00
Roger Wang
04139ade59
[V1] Fix profiling for models with merged input processor ( #11370 )
...
Signed-off-by: ywang96 <ywang@roblox.com >
2024-12-20 12:04:21 +00:00
youkaichao
1ecc645b8f
[doc] backward compatibility for 0.6.4 ( #11359 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-19 21:33:53 -08:00
youkaichao
c954f21ac0
[misc] add early error message for custom ops ( #11355 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-19 21:18:25 -08:00
Wallas Henrique
86c2d8fd1c
[Bugfix] Fix spec decoding when seed is none in a batch ( #10863 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-12-20 05:15:31 +00:00
Michael Goin
b880ffb87e
[Misc] Add tqdm progress bar during graph capture ( #11349 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-20 04:35:18 +00:00
youkaichao
7801f56ed7
[ci][gh200] dockerfile clean up ( #11351 )
...
Signed-off-by: drikster80 <ed.sealing@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: drikster80 <ed.sealing@gmail.com >
Co-authored-by: cenzhiyao <2523403608@qq.com >
2024-12-19 18:13:06 -08:00
Akash kaothalkar
48edab8041
[Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 ( #11331 )
...
Signed-off-by: Akash Kaothalkar <0052v2@linux.vnet.ibm.com >
2024-12-20 01:32:07 +00:00
Yuan
a985f7af9f
[CI] Adding CPU docker pipeline ( #11261 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-12-19 11:46:55 -08:00
yangzhibin
e461c262f0
[Misc] Remove unused vllm/block.py ( #11336 )
2024-12-19 17:54:24 +00:00
Isotr0py
276738ce0f
[Bugfix] Fix broken CPU compressed-tensors test ( #11338 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-19 17:37:31 +00:00
Cyrus Leung
cdf22afdda
[Misc] Clean up and consolidate LRUCache ( #11339 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-20 00:59:32 +08:00
Isotr0py
e24113a8fe
[Model] Refactor Qwen2-VL to use merged multimodal processor ( #11258 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 16:28:00 +00:00
Roger Wang
7379b3d4b2
[V1] Fix multimodal profiling for Molmo ( #11325 )
...
Signed-off-by: ywang96 <ywang@example.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-19 16:27:22 +00:00
Yehoshua Cohen
6c7f881541
[Model] Add JambaForSequenceClassification model ( #10860 )
...
Signed-off-by: Yehoshua Cohen <yehoshuaco@ai21.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Yehoshua Cohen <yehoshuaco@ai21.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 22:48:06 +08:00
Cyrus Leung
a0f7d53beb
[Bugfix] Cleanup Pixtral HF code ( #11333 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 13:22:00 +00:00
Yanyi Liu
5aef49806d
[Feature] Add load generation config from model ( #11164 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
Signed-off-by: Yanyi Liu <wolfsonliu@163.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-12-19 10:50:38 +00:00
Varun Sundar Rabindranath
98356735ac
[misc] benchmark_throughput : Add LoRA ( #11267 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-19 15:43:16 +08:00
Rui Qiao
f26c4aeecb
[Misc] Optimize ray worker initialization time ( #11275 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-18 23:38:02 -08:00
Varun Sundar Rabindranath
8936316d58
[Kernel] Refactor Cutlass c3x ( #10049 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-19 07:00:18 +00:00
Cyrus Leung
6142ef0ada
[VLM] Merged multimodal processor for Qwen2-Audio ( #11303 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 06:14:17 +00:00
Chen Zhang
c6b0a7d3ba
[V1] Simplify prefix caching logic by removing num_evictable_computed_blocks ( #11310 )
2024-12-19 04:17:12 +00:00
Michael Goin
a30482f054
[CI] Expand test_guided_generate to test all backends ( #11313 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-19 04:00:38 +00:00
Travis Johnson
17ca964273
[Model] IBM Granite 3.1 ( #11307 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-19 11:27:24 +08:00
Tyler Michael Smith
5a9da2e6e9
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) ( #11311 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-19 02:43:30 +00:00
Alexander Matveev
fdea8ec167
[V1] VLM - enable processor cache by default ( #11305 )
...
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
2024-12-18 18:54:46 -05:00
Joe Runde
ca5f54a9b9
[Bugfix] fix minicpmv test ( #11304 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-18 10:34:26 -08:00
Kunshang Ji
f954fe0e65
[FIX] update openai version ( #11287 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-12-18 10:17:05 -08:00
Simon Mo
362cff1eb3
[CI][Misc] Remove Github Action Release Workflow ( #11274 )
2024-12-18 10:16:53 -08:00
Isotr0py
996aa70f00
[Bugfix] Fix broken phi3-v mm_processor_kwargs tests ( #11263 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-18 10:16:40 -08:00
Dipika Sikka
60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support ( #10995 )
...
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com >
Co-authored-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-12-18 09:57:16 -05:00
Yan Ma
f04e407e6b
[MISC][XPU]update ipex link for CI fix ( #11278 )
2024-12-17 22:34:23 -08:00
Wallas Henrique
8b79f9e107
[Bugfix] Fix guided decoding with tokenizer mode mistral ( #11046 )
2024-12-17 22:34:08 -08:00
Konrad Zawora
866fa4550d
[Bugfix] Restore support for larger block sizes ( #11259 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-17 16:39:07 -08:00
Cody Yu
bf8717ebae
[V1] Prefix caching for vision language models ( #11187 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-17 16:37:59 -08:00
Michael Goin
c77eb8a33c
[Bugfix] Set temperature=0.7 in test_guided_choice_chat ( #11264 )
2024-12-17 16:34:06 -08:00
Joe Runde
2d1b9baa8f
[Bugfix] Fix request cancellation without polling ( #11190 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-12-17 12:26:32 -08:00
Isotr0py
f9ecbb18bf
[Misc] Allow passing logits_soft_cap for xformers backend ( #11252 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-17 00:37:04 -08:00
Roger Wang
02222a0256
[Misc] Kernel Benchmark for RMSNorm ( #11241 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Xiaoyu Zhang <BBuf@users.noreply.github.com >
2024-12-17 06:57:02 +00:00
Tyler Michael Smith
2bfdbf2a36
[V1][Core] Use weakref.finalize instead of atexit ( #11242 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-16 22:11:33 -08:00
wangxiyuan
e88db68cf5
[Platform] platform agnostic for EngineArgs initialization ( #11225 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-16 22:11:06 -08:00
Roger Wang
59c9b6ebeb
[V1][VLM] Proper memory profiling for image language models ( #11210 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-16 22:10:57 -08:00
kYLe
66d4b16724
[Frontend] Add OpenAI API support for input_audio ( #11027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-16 22:09:58 -08:00
Michael Goin
0064f697d3
[CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse ( #10935 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-17 11:39:58 +08:00
youkaichao
35bae114a8
fix gh200 tests on main ( #11246 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 17:22:38 -08:00
youkaichao
88a412ed3d
[torch.compile] fast inductor ( #11108 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-16 16:15:22 -08:00
youkaichao
c301616ed2
[ci][tests] add gh200 tests ( #11244 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 15:53:18 -08:00
bk-TurbaAI
35ffa682b1
[Docs] hint to enable use of GPU performance counters in profiling tools for multi-node distributed serving ( #11235 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-12-16 22:20:39 +00:00
youkaichao
551603feff
[core] overhaul memory profiling and fix backward compatibility ( #10511 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 13:32:25 -08:00
Varun Sundar Rabindranath
efbce85f4d
[misc] Layerwise profile updates ( #10242 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-16 18:14:57 +00:00
Isotr0py
2ca830dbaa
[Doc] Reorder vision language examples in alphabet order ( #11228 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-16 11:23:33 +00:00
Isotr0py
d927dbcd88
[Model] Refactor Ultravox to use merged input processor ( #11198 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-16 10:09:53 +00:00
Jani Monoses
bddbbcb132
[Model] Support Cohere2ForCausalLM (Cohere R7B) ( #11203 )
2024-12-16 09:56:19 +00:00
cennn
b3b1526f03
WIP: [CI/Build] simplify Dockerfile build for ARM64 / GH200 ( #11212 )
...
Signed-off-by: drikster80 <ed.sealing@gmail.com >
Co-authored-by: drikster80 <ed.sealing@gmail.com >
2024-12-16 09:20:49 +00:00
yansh97
17138af7c4
[Bugfix] Fix the default value for temperature in ChatCompletionRequest ( #11219 )
2024-12-16 00:15:40 -08:00
chenqianfzh
69ba344de8
[Bugfix] Fix block size validation ( #10938 )
2024-12-15 16:38:40 -08:00
AlexHe99
da6f409246
Update deploying_with_k8s.rst ( #10922 )
2024-12-15 16:33:58 -08:00
Woosuk Kwon
25ebed2f8c
[V1][Minor] Cache np arange to reduce input preparation overhead ( #11214 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-15 13:33:00 -08:00
shangmingc
d263bd9df7
[Core] Support disaggregated prefill with Mooncake Transfer Engine ( #10884 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2024-12-15 21:28:18 +00:00
Kuntai Du
38e599d6a8
[Doc] add documentation for disaggregated prefilling ( #11197 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2024-12-15 13:31:16 -06:00
Cyrus Leung
96d673e0f8
[Bugfix] Fix error handling of unsupported sliding window ( #11213 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-15 10:59:42 -07:00
Cyrus Leung
b10609e6a1
[Misc] Clean up multi-modal processor ( #11207 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-15 06:30:28 +00:00
youkaichao
a1c02058ba
[torch.compile] allow tracking forward time ( #11081 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-14 19:45:00 -08:00
Jee Jee Li
15859f2357
[[Misc]Upgrade bitsandbytes to the latest version 0.45.0 ( #11201 )
2024-12-15 03:03:06 +00:00
Sungjae Lee
886936837c
[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion ( #7209 )
2024-12-14 11:38:10 -08:00
Mark McLoughlin
6d917d0eeb
Enable mypy checking on V1 code ( #11105 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2024-12-14 09:54:04 -08:00
Cyrus Leung
93abf23a64
[VLM] Fully dynamic prompt replacement in merged input processor ( #11199 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-14 17:52:18 +00:00
Brad Hilton
9c3dadd1c9
[Frontend] Add logits_processors as an extra completion argument ( #11150 )
...
Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com >
2024-12-14 16:46:42 +00:00
Jee Jee Li
3cb5769883
[Misc] Minor improvements to the readability of PunicaWrapperBase ( #11200 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-14 16:38:27 +00:00
Tyler Michael Smith
ea7bd68d10
[V1][Bugfix] Fix V1 TP trust-remote-code ( #11182 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-14 08:21:23 +00:00
Russell Bryant
48259264a4
[Core] Update outlines and increase its threadpool size ( #11140 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-14 07:46:18 +00:00
dhuangnm
24a3d12b82
update compressed-tensors to latest version ( #11183 )
...
Co-authored-by: dhuangnm <dhuang@MacBook-Pro-2.local >
2024-12-14 03:22:44 +00:00
Cody Yu
9855aea21b
[Bugfix][V1] Re-compute an entire block when fully cache hit ( #11186 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-13 17:08:23 -08:00
Tyler Michael Smith
4b5b8a6a3b
[V1][Bugfix] Fix EngineCoreProc profile ( #11185 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-13 17:02:35 -08:00
Russell Bryant
4863e5fba5
[Core] V1: Use multiprocessing by default ( #11074 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-13 16:27:32 -08:00
Jiaxin Shan
0d8451c3a4
[Distributed] Allow the placement group more time to wait for resources to be ready ( #11138 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-13 20:17:37 +00:00
Jani Monoses
0a56bcc03d
[Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend ( #11169 )
2024-12-13 18:00:40 +00:00
Cyrus Leung
0920ab9131
[Doc] Reorganize online pooling APIs ( #11172 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-14 00:22:22 +08:00
Alexander Matveev
238c0d93b4
[Misc] Add tokenizer_mode param to benchmark_serving.py ( #11174 )
...
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
2024-12-13 16:19:10 +00:00
zhangjf
5b0ed8391d
[Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in AllowedTokenIdsLogitsProcessor ( #11156 )
2024-12-13 15:56:19 +00:00
Sungjae Lee
c31d4a57a6
[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching ( #8240 )
2024-12-13 07:51:25 -08:00
Chenguang Li
d1fa714cb1
[Refactor]A simple device-related refactor ( #11163 )
...
Signed-off-by: noemotiovon <noemotiovon@gmail.com >
Co-authored-by: noemotiovon <noemotiovon@gmail.com >
2024-12-13 13:39:00 +00:00
Roger Wang
969da7d70b
[V1][VLM] Fix edge case bug for InternVL2 ( #11165 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-13 11:09:30 +00:00
Cyrus Leung
eeec9e3390
[Frontend] Separate pooling APIs in offline inference ( #11129 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-13 10:40:07 +00:00
Li, Jiang
f93bf2b189
[Bugfix][CI][CPU] add missing datasets package to requirements-cpu.txt ( #11159 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-12-13 08:50:35 +00:00
Jani Monoses
7cd7409142
PaliGemma 2 support ( #11142 )
2024-12-13 07:40:07 +00:00
youkaichao
be39e3cd18
[core] clean up cudagraph batchsize padding logic ( #10996 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-13 06:57:50 +00:00
Cody Yu
34f1a806d5
[Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' ( #11157 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-13 06:30:06 +00:00
Gregory Shtrasberg
00c1bde5d8
[ROCm][AMD] Disable auto enabling chunked prefill on ROCm ( #11146 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-13 05:31:26 +00:00
Dipika Sikka
3989a79824
[Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quantization ( #11148 )
2024-12-13 05:07:20 +00:00
Pooya Davoodi
1efce68605
[Bugfix] Use runner_type instead of task in GritLM ( #11144 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2024-12-13 04:09:53 +00:00
Luka Govedič
30870b4f66
[torch.compile] Dynamic fp8 + rms_norm fusion ( #10906 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-13 03:19:23 +00:00
Cody Yu
78ed8f57d8
[Misc][V1] Fix type in v1 prefix caching ( #11151 )
2024-12-13 00:57:40 +00:00
shangmingc
db6c264a1e
[Bugfix] Fix value unpack error of simple connector for KVCache transfer. ( #11058 )
...
Signed-off-by: ShangmingCai <csmthu@gmail.com >
2024-12-12 21:19:17 +00:00
Jeremy Arnold
9f3974a319
Fix logging of the vLLM Config ( #11143 )
2024-12-12 12:05:57 -08:00
Cody Yu
2c97eca1ff
[Misc] Validate grammar and fail early ( #11119 )
2024-12-12 18:34:26 +00:00
Jeff Cook
5d712571af
[Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e227c7ae. ( #11024 )
2024-12-12 18:09:20 +00:00
Ramon Ziai
d4d5291cc2
fix(docs): typo in helm install instructions ( #11141 )
...
Signed-off-by: Ramon Ziai <ramon.ziai@bettermarks.com >
2024-12-12 17:36:32 +00:00
Roger Wang
4816d20aa4
[V1] Fix torch profiling for offline inference ( #11125 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-12 15:51:53 +00:00
Jiaxin Shan
85362f028c
[Misc][LoRA] Ensure Lora Adapter requests return adapter name ( #11094 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-12 09:25:16 +00:00
youkaichao
62de37a38e
[core][distributed] initialization from StatelessProcessGroup ( #10986 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-12 09:04:19 +00:00
Sanju C Sudhakaran
8195824206
[Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) ( #10565 )
...
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
2024-12-12 08:09:28 +00:00
Woosuk Kwon
f092153fbe
[V1] Use more persistent buffers to optimize input preparation overheads ( #11111 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-11 23:14:20 -08:00
Pooya Davoodi
1da8f0e1dd
[Model] Add support for embedding model GritLM ( #10816 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2024-12-12 06:39:16 +00:00
Russell Bryant
ccede2b264
[Core] cleanup zmq ipc sockets on exit ( #11115 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-11 19:12:24 -08:00
Yuan Tang
24a36d6d5f
Update link to LlamaStack remote vLLM guide in serving_with_llamastack.rst ( #11112 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-12 02:39:21 +00:00
Simon Mo
8fb26dac61
[Docs] Add media kit ( #11121 )
2024-12-11 17:33:11 -08:00
Clayton
7439a8b5fc
[Bugfix] Multiple fixes to tool streaming with hermes and mistral ( #10979 )
...
Signed-off-by: cedonley <clayton@donley.io >
2024-12-12 01:10:12 +00:00
Alexander Matveev
4e11683368
[V1] VLM preprocessor hashing ( #11020 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-12 00:55:30 +00:00
Tyler Michael Smith
452a723bf2
[V1][Core] Remove should_shutdown to simplify core process termination ( #11113 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-11 23:34:54 +00:00
Cyrus Leung
d1e21a979b
[CI/Build] Split up VLM tests ( #11083 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-12 06:18:16 +08:00
Rui Qiao
72ff3a9686
[core] Bump ray to use _overlap_gpu_communication in compiled graph tests ( #10410 )
...
Signed-off-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal >
2024-12-11 11:36:35 -08:00
youkaichao
66aaa7722d
[torch.compile] remove graph logging in ci ( #11110 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-11 10:59:50 -08:00
Woosuk Kwon
d643c2aba1
[V1] Use input_ids as input for text-only models ( #11032 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-11 10:49:23 -08:00
youkaichao
91642db952
[torch.compile] use depyf to dump torch.compile internals ( #10972 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-11 10:43:05 -08:00
bingps
fd22220687
[Doc] Installed version of llmcompressor for int8/fp8 quantization ( #11103 )
...
Signed-off-by: Guangda Liu <bingps@users.noreply.github.com >
Co-authored-by: Guangda Liu <bingps@users.noreply.github.com >
2024-12-11 15:43:24 +00:00
hissu-hyvarinen
b2f775456e
[CI/Build] Enable prefix caching test for AMD ( #11098 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2024-12-11 15:23:37 +00:00
Cyrus Leung
cad5c0a6ed
[Doc] Update docs to refer to pooling models ( #11093 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 13:36:27 +00:00
Cyrus Leung
8f10d5e393
[Misc] Split up pooling tasks ( #10820 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 01:28:00 -08:00
Rafael Vasquez
40766ca1b8
[Bugfix]: Clamp -inf logprob values in prompt_logprobs ( #11073 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-12-11 01:27:39 -08:00
B-201
2e32f5d28d
[Bugfix] Fix Idefics3 fails during multi-image inference ( #11080 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-12-11 01:27:07 -08:00
Russell Bryant
61b1d2f6ae
[Core] v1: Use atexit to handle engine core client shutdown ( #11076 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-11 01:26:36 -08:00
Kevin H. Luu
9974fca047
[ci/build] Fix entrypoints test and pin outlines version ( #11088 )
2024-12-11 01:01:53 -08:00
Kevin H. Luu
3fb4b4f163
[ci/build] Fix AMD CI dependencies ( #11087 )
2024-12-11 00:39:53 -08:00
Cyrus Leung
2e33fe4191
[CI/Build] Check transformers v4.47 ( #10991 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 05:02:02 +00:00
Maximilien de Bayser
e39400a4b6
Fix streaming for granite tool call when <|tool_call|> is present ( #11069 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-12-11 04:51:40 +00:00
Mor Zusman
ffa48c9146
[Model] PP support for Mamba-like models ( #10992 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-12-10 21:53:37 -05:00
Aurick Qiao
d5c5154fcf
[Misc] LoRA + Chunked Prefill ( #9057 )
2024-12-11 10:09:20 +08:00
Tyler Michael Smith
9a93973708
[Bugfix] Fix Mamba multistep ( #11071 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-11 00:16:22 +00:00
Woosuk Kwon
134810b3d9
[V1][Bugfix] Always set enable_chunked_prefill = True for V1 ( #11061 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-10 14:41:23 -08:00
youkaichao
75f89dc44c
[torch.compile] add a flag to track batchsize statistics ( #11059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-10 12:40:52 -08:00
Russell Bryant
e739194926
[Core] Update to outlines >= 0.1.8 ( #10576 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-10 12:08:16 -08:00
Flávia Béo
250ee65d72
[BUG] Remove token param #10921 ( #11022 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
2024-12-10 17:38:15 +00:00
Joe Runde
9b9cef3145
[Bugfix] Backport request id validation to v0 ( #11036 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-10 16:38:23 +00:00
Jee Jee Li
d05f88679b
[Misc][LoRA] Add PEFTHelper for LoRA ( #11003 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-10 11:12:01 +00:00
Travis Johnson
beb16b2c81
[Bugfix] Handle <|tool_call|> token in granite tool parser ( #11039 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-10 10:27:11 +00:00
Maxime Fournioux
fe2e10c71b
Add example of helm chart for vllm deployment on k8s ( #9199 )
...
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com >
2024-12-10 09:19:27 +00:00
Gene Der Su
82c73fd510
[Bugfix] cuda error running llama 3.2 ( #11047 )
2024-12-10 07:41:11 +00:00
Diego Marinho
bfd610430c
Update README.md ( #11034 )
2024-12-09 23:08:10 -08:00
Jeff Cook
e35879c276
[Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig on PixtralHF. ( #11043 )
2024-12-10 14:54:22 +08:00
youkaichao
ebf778061d
monitor metrics of tokens per step using cudagraph batchsizes ( #11031 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-09 22:35:36 -08:00
Tyler Michael Smith
28b3a1c7e5
[V1] Multiprocessing Tensor Parallel Support for v1 ( #9856 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-10 06:28:14 +00:00
Patrick von Platen
bc192a2b09
[Pixtral] Improve loading ( #11040 )
2024-12-10 06:09:32 +00:00
Joe Runde
980ad394a8
[Frontend] Use request id from header ( #10968 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-10 13:46:29 +08:00
Cyrus Leung
391d7b2763
[Bugfix] Fix usage of deprecated decorator ( #11025 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-10 13:45:47 +08:00
Isotr0py
d1f6d1c8af
[Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba ( #10739 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-10 10:23:07 +08:00
Michael Goin
6d525288c1
[Docs] Add dedicated tool calling page to docs ( #10554 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-09 20:15:34 -05:00
Woosuk Kwon
6faec54505
[V1] Do not store None in self.generators ( #11038 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-09 15:08:19 -08:00
Richard Liu
5ed5d5f128
Build tpu image in release pipeline ( #10936 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-12-09 23:07:48 +00:00
Gregory Shtrasberg
b63ba84832
[ROCm][bugfix] scpecilative decoding worker class ( #11035 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-09 14:00:29 -08:00
xendo
9c6459e4cb
[Neuron] Upgrade neuron to 2.20.2 ( #11016 )
...
Signed-off-by: Jerzy Zagorski <jzagorsk@amazon.com >
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-12-09 13:53:24 -08:00
youkaichao
1a2f8fb828
[v1] fix use compile sizes ( #11000 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-09 13:47:24 -08:00
Konrad Zawora
cbcbdb1ceb
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version ( #11028 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-09 13:21:06 -08:00
Isotr0py
a811dd6608
[Model] merged input processor for Phi-3-Vision models ( #10977 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-09 12:55:10 -08:00
Jee Jee Li
ca871491ed
[Misc][LoRA] Abstract PunicaWrapper ( #10955 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-09 12:54:44 -08:00
Woosuk Kwon
3b61cb450d
[V1] Further reduce CPU overheads in flash-attn ( #10989 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-09 12:38:46 -08:00
Kevin H. Luu
edc4fa3188
[ci/build] Recompile CI dependencies list with Python 3.12 ( #11013 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-12-09 11:46:58 -08:00
Varun Sundar Rabindranath
25b79d9fd3
[V1] Input Batch Relocation ( #10962 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-09 09:33:41 -08:00
wangxiyuan
aea2fc38c3
[Platform] Move async output check to platform ( #10768 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-09 17:24:46 +00:00
Russell Bryant
e691b26f6f
[Core] Require xgrammar >= 0.1.6 ( #11021 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-09 16:44:27 +00:00
Roger Wang
c690357928
[V1] Fix Detokenizer loading in AsyncLLM ( #10997 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-09 16:27:10 +00:00
youkaichao
d1c2e15eb3
[torch.compile] add dynamo time tracking ( #11005 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 23:09:04 -08:00
Roger Wang
af7c4a92e6
[Doc][V1] Add V1 support column for multimodal models ( #10998 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-08 22:29:16 -08:00
youkaichao
46004e83a2
[misc] clean up and unify logging ( #10999 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 17:28:27 -08:00
youkaichao
43b05fa314
[torch.compile][misc] fix comments ( #10993 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 11:18:18 -08:00
Roger Wang
a11f326528
[V1] Initial support of multimodal models for V1 re-arch ( #10699 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-08 12:50:51 +00:00
youkaichao
fd57d2b534
[torch.compile] allow candidate compile sizes ( #10984 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 11:05:21 +00:00
youkaichao
7be15d9356
[core][misc] remove use_dummy driver for _run_workers ( #10920 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-07 12:06:08 -08:00
youkaichao
1b62745b1d
[core][executor] simplify instance id ( #10976 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-07 09:33:45 -08:00
zhou fan
78029b34ed
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None ( #10928 )
...
Signed-off-by: xffxff <1247714429@qq.com >
2024-12-08 01:21:18 +08:00
Cyrus Leung
c889d5888b
[Doc] Explicitly state that PP isn't compatible with speculative decoding yet ( #10975 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 17:20:49 +00:00
Cyrus Leung
39e227c7ae
[Model] Update multi-modal processor to support Mantis(LLaVA) model ( #10711 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 17:10:05 +00:00
Cyrus Leung
1c768fe537
[Doc] Explicitly state that InternVL 2.5 is supported ( #10978 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 16:58:02 +00:00
Cyrus Leung
bf0e382e16
[Model] Composite weight loading for multimodal Qwen2 ( #10944 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 07:22:52 -07:00
Isotr0py
b26b4cd03c
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora implementation ( #10958 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-07 18:33:49 +08:00
Gregory Shtrasberg
f13cf9ad50
[Build] Fix for the Wswitch-bool clang warning ( #10060 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-07 09:03:44 +00:00
Cyrus Leung
955fa9533a
[3/N] Support and implement merged input processor for LLaVA model ( #10676 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-07 00:50:58 -08:00
Jee Jee Li
acf092d348
[Bugfix] Fix test-pipeline.yaml ( #10973 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-07 12:08:54 +08:00
Russell Bryant
69d357ba12
[Core] Cleanup startup logging a bit ( #10961 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-07 02:30:23 +00:00
youkaichao
dcdc3fafe5
[ci] fix broken tests ( #10956 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:25:47 -08:00
youkaichao
c05cfb67da
[misc] fix typo ( #10960 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:25:20 -08:00
Sam Stoelinga
7406274041
[Doc] add KubeAI to serving integrations ( #10837 )
...
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com >
2024-12-06 17:03:56 +00:00
Michael Goin
8b59631855
[Core] Support Lark grammars for XGrammar ( #10870 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-06 08:34:29 -07:00
youkaichao
a1887f2c96
[torch.compile] fix deprecated code ( #10948 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:01:23 +00:00
Cyrus Leung
222f5b082a
[CI/Build] Fix broken multimodal test ( #10950 )
2024-12-06 10:41:23 +00:00
youkaichao
b031a455a9
[torch.compile] add logging for compilation time ( #10941 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-06 10:07:15 +00:00
youkaichao
db87eb6c67
[torch.compile] use size tuning for specific sizes ( #10933 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-05 20:30:41 -08:00
youkaichao
9743d64e4e
[ci][build] add tests for python only compilation ( #10915 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-05 08:54:47 -08:00
Konrad Zawora
a43065272f
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives ( #10897 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-05 08:47:46 -08:00
Isotr0py
998eeafe58
[CI/Build] Bump test transformers version ( #10106 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-05 16:05:52 +00:00
Jee Jee Li
571da8fc43
[Misc][LoRA] Clean up the function interface of Punica ( #10917 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-05 13:22:28 +00:00
Travis Johnson
39c89e71a8
[Misc] Update llama 3.2 template to support system prompt with images ( #10901 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-05 05:54:06 +00:00
Jee Jee Li
1f958a7d52
[Bugfix] Fix BNB loader target_modules ( #10720 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-05 13:20:26 +08:00
Cyrus Leung
aa39a8e175
[Doc] Create a new "Usage" section ( #10827 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-05 11:19:35 +08:00
Michael Goin
8d370e91cb
[Bugfix] Fallback to outlines for complex json schemas ( #10899 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-05 11:14:06 +08:00
Kevin H. Luu
7883c2bbe7
[benchmark] Make H100 benchmark optional ( #10908 )
2024-12-04 17:02:17 -08:00
Woosuk Kwon
2a56e1264f
[V1] Fix when max_model_len is not divisible by block_size ( #10903 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-04 16:54:05 -08:00
Daniele
e4c34c23de
[CI/Build] improve python-only dev setup ( #9621 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-12-04 21:48:13 +00:00
Chendi.Xue
82eb5ea8f3
Benchmark serving structured output ( #10880 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-12-04 16:28:21 -05:00
Isotr0py
10398b4706
[Model] Consolidate ViTs attention implementation without mask ( #10893 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-04 18:11:08 +00:00
Xin Yang
01d079fd8e
[LoRA] Change lora_tokenizers capacity ( #10796 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-12-04 17:40:16 +00:00
Kevin H. Luu
c92acb9693
[ci/build] Update vLLM postmerge ECR repo ( #10887 )
2024-12-04 09:01:20 +00:00
jianzheng
8db957ee3a
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 ( #10854 )
...
Signed-off-by: jianzheng <57654625+o2363286@users.noreply.github.com >
2024-12-04 08:48:22 +00:00
Kevin H. Luu
c9ca4fce3f
[ci/build] Job to build and push release image ( #10877 )
2024-12-04 15:02:40 +08:00
Kevin H. Luu
fa2dea61df
[ci/build] Change queue name for Release jobs ( #10875 )
2024-12-04 15:02:16 +08:00
wangxiyuan
b5b647b084
Drop ROCm load format check ( #10767 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-04 04:32:21 +00:00
Tyler Michael Smith
d2bd88b122
[CI/Build] Replace mean with torch.all in test_pynccl.py ( #10876 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-04 03:23:21 +00:00
Chendi.Xue
381ac93bb5
[Benchmark] Benchmark structured output with datasets ( #10557 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2024-12-03 17:21:06 -07:00
Gregory Shtrasberg
a061fe601e
[Build][Bugfix] Using the correct type hint ( #10866 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-03 15:47:55 -05:00
tomeras91
7c32b6861e
[Frontend] correctly record prefill and decode time metrics ( #10853 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-12-03 19:13:31 +00:00
Michael Goin
7090c27bb2
[Bugfix] Only require XGrammar on x86 ( #10865 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-03 10:32:21 -08:00
Yan Ma
2f2cdc745a
[MISC][XPU] quick fix for XPU CI ( #10859 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-12-03 17:16:31 +00:00
Alexander Matveev
3bc94cab69
[V1] VLM - Run the mm_mapper preprocessor in the frontend process ( #10640 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-03 10:33:10 +00:00
Yang Zheng
f6084f6324
[Speculative Decoding] Move indices to device before filtering output ( #10850 )
...
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com >
2024-12-03 17:01:39 +08:00
Aaron Pham
9323a3153b
[Core][Performance] Add XGrammar support for guided decoding and set it as default ( #10785 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-12-03 15:17:00 +08:00
Cyrus Leung
3257d449fa
[Misc] Remove deprecated names ( #10817 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-03 06:52:57 +00:00
Russell Bryant
ef51831ee8
[Doc] Add github links for source code references ( #10672 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-03 06:46:07 +00:00
youkaichao
dc5ce861bf
[torch.compile] remove compilation_context and simplify code ( #10838 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-03 06:19:02 +00:00
youkaichao
21fe7b481a
[core][distributed] add pynccl broadcast ( #10843 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-03 04:53:23 +00:00
Jee Jee Li
a4cf256159
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug ( #10844 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-03 12:10:29 +08:00
zixuanzhang226
d746268e92
[Model] support bitsandbytes quantization with minicpm model ( #10842 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-12-03 03:06:41 +00:00
Michael Goin
4433195ab7
[Bugfix] Prevent benchmark_throughput.py from using duplicated random prompts ( #10753 )
2024-12-03 02:26:15 +00:00
Isotr0py
4c05edb33a
[Model] Add TP and BNB quantization support to LlavaMultiModalProjector ( #10834 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-02 23:06:09 +00:00
Jani Monoses
9b14d978aa
Fix openvino on GPU ( #10793 )
2024-12-02 18:52:19 +00:00
Yan Ma
519cc6ca12
[Misc][XPU] Avoid torch compile for XPU platform ( #10747 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-12-02 17:53:55 +00:00
Jee Jee Li
b45f0d7946
[Misc][LoRA] Move the implementation of lora bias to punica.py ( #10829 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-02 17:53:36 +00:00
youkaichao
a4c4daf364
[misc] use out argument for flash attention ( #10822 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-02 10:50:10 +00:00
Cyrus Leung
e95f275f57
[CI/Build] Update mistral_common version for tests and docs ( #10825 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-02 10:26:10 +00:00
zhou fan
ef31eabc68
[Model]: add some tests for aria model ( #10770 )
...
Signed-off-by: xffxff <1247714429@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-02 05:36:36 +00:00
wangxiyuan
995a148575
[doc]Update config docstring ( #10732 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-02 04:14:45 +00:00
youkaichao
63a164172d
[misc] remove xverse modeling file ( #10814 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-02 03:27:13 +00:00
Maximilien de Bayser
e25810ae29
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill ( #10799 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-12-02 10:05:32 +08:00
Woosuk Kwon
073a4bd1c0
[Kernel] Use out arg in flash_attn_varlen_func ( #10811 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-01 17:55:39 -08:00
cduk
b7954776fd
[core] Avoid metrics log noise when idle - include speculative decodi… ( #10809 )
2024-12-02 01:49:48 +00:00
Isotr0py
b18c9bbaba
[Model] Add BNB support to Llava and Pixtral-HF ( #10795 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-02 01:31:09 +00:00
Kuntai Du
0590ec3fd9
[Core] Implement disagg prefill by StatelessProcessGroup ( #10502 )
...
This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Co-authored-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn >
2024-12-01 19:01:00 -06:00
Roger Wang
c11f172187
[Misc] Adding MMMU-Pro vision dataset to serving benchmark ( #10804 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-01 08:47:05 +00:00
youkaichao
169a0ff911
[doc] add warning about comparing hf and vllm outputs ( #10805 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-01 00:41:38 -08:00
Cyrus Leung
d2f058e76c
[Misc] Rename embedding classes to pooling ( #10801 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-01 14:36:51 +08:00
Cyrus Leung
f877a7d12a
[Misc] Improve type annotations for support_torch_compile ( #10763 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-30 17:48:35 -08:00
Cyrus Leung
133707123e
[Model] Replace embedding models with pooling adapter ( #10769 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-01 08:02:54 +08:00
wangxiyuan
7e4bbda573
[doc] format fix ( #10789 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-11-30 11:38:40 +00:00
Patrick von Platen
e7cfc4ef4c
[Interleaved ATTN] Support for Mistral-8B ( #10591 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-30 07:45:50 +00:00
Isotr0py
16ee07f22a
[Model] Refactor Molmo weights loading to use AutoWeightsLoader ( #10771 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-30 04:19:14 +00:00
Nicolò Lucchesi
40bc242579
[Bugfix] Fix OpenVino/Neuron driver_worker init ( #10779 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-11-30 12:07:13 +08:00
wangxiyuan
661175bc82
[platform] Add verify_quantization in platform. ( #10757 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-11-29 15:22:21 +00:00
Jee Jee Li
3132aac043
[Bugfix] Fix Idefics3 bug ( #10778 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-29 13:56:46 +00:00
wang.yuqi
c82b432d4a
[Misc] typo find in sampling_metadata.py ( #10740 )
2024-11-29 05:17:57 +00:00
Cyrus Leung
fa6ecb9aa7
[Model] Clean up MiniCPMV ( #10751 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-29 04:47:06 +00:00
Isotr0py
c83919c7a6
[Model] Add Internlm2 LoRA support ( #5064 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-28 17:29:04 +00:00
Woosuk Kwon
98f47f2a40
[V1] Optimize the CPU overheads in FlashAttention custom op ( #10733 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 09:01:02 -08:00
Woosuk Kwon
8c1e77fb58
[Kernel] Update vllm-flash-attn version to reduce CPU overheads ( #10742 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 08:31:28 -08:00
sixgod
5fc5ce0fe4
[Model] Added GLM-4 series hf format model support vllm==0.6.4 ( #10561 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-11-28 14:53:31 +00:00
Richard Liu
3ed5e73146
[TPU] Update requirements-tpu ( #10726 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2024-11-28 02:30:48 -08:00
Woosuk Kwon
9a8bff0285
[Kernel] Update vllm-flash-attn version ( #10736 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 02:25:59 -08:00
Woosuk Kwon
a79b122400
[V1] Do not allocate beyond the max_model_len ( #10730 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 00:13:15 -08:00
Ricky Xu
d9b4b3f069
[Bug][CLI] Allow users to disable prefix caching explicitly ( #10724 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-27 23:59:28 -08:00
罗泽轩
278be671a3
[Doc] Update model in arch_overview.rst to match comment ( #10701 )
...
Signed-off-by: spacewander <spacewanderlzx@gmail.com >
2024-11-27 23:58:39 -08:00
zixuanzhang226
70dc14fbd0
[Model] support bitsandbytes quantization with minicpm3 model ( #10682 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-11-27 23:58:02 -08:00
youkaichao
cb4e1c3f3a
[misc] upgrade filelock version ( #10731 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 19:54:58 -08:00
tomeras91
395b1c7454
[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server ( #10635 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-11-27 13:21:10 -08:00
Cyrus Leung
9b4b150395
[Bugfix] Ignore lm_head when loading embedding models ( #10719 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-27 19:05:29 +00:00
Mor Zusman
197b4484a3
[Bugfix][Mamba] Fix Multistep on Mamba-like models ( #10705 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-11-27 19:02:27 +00:00
Isotr0py
b98c62ba49
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint ( #10675 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-27 10:43:17 -08:00
youkaichao
c411def234
[torch.compile] fix shape specialization ( #10722 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 10:16:10 -08:00
youkaichao
308cc5e21e
[ci] fix slow tests ( #10698 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 09:26:14 -08:00
Roger Wang
9e0a147d50
[V1] Update interface for mistral-format Pixtral ( #10703 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-27 12:26:27 +00:00
Li, Jiang
418cb3b93f
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault ( #10700 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-27 11:55:38 +00:00
shunxing12345
1209261e93
[Model] Support telechat2 ( #10311 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: xiangw2 <xiangw2@chinatelecom.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-11-27 11:32:35 +00:00
Tyler Michael Smith
e2251109c7
[Kernel] Remove if-else with identical branches in marlin 2:4 ( #10687 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-26 22:55:32 -08:00
Jee Jee Li
15cc2a9f1a
[Misc]Further reduce BNB static variable ( #10597 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-26 22:54:12 -08:00
Kunshang Ji
e85250b1d1
[Hardware][Gaudi]add get_name method for HPUAttentionBackend ( #10667 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-26 22:49:40 -08:00
yansh97
cfb3bf25fb
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig ( #10657 )
2024-11-27 13:55:23 +08:00
jeongin601
1bf905ddaa
[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. ( #10198 )
...
Signed-off-by: jeongin601 <0200angela@gmail.com >
Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com >
2024-11-27 05:07:30 +00:00
Roger Wang
0a4d968500
[V1] Update interface for idefics3 ( #10680 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-27 10:04:01 +08:00
Chendi.Xue
0a71900bc9
Remove hard-dependencies of Speculative decode to CUDA workers ( #10587 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2024-11-26 17:57:11 -08:00
Roger Wang
2f0a0a17a4
[V1] Refactor model executable interface for multimodal models ( #10570 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-26 20:46:11 +00:00
Michael Goin
7576cd38df
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes ( #10642 )
2024-11-26 12:29:00 -08:00
Michael Goin
9a99273b48
[Bugfix] Fix using -O[0,3] with LLM entrypoint ( #10677 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-26 10:44:01 -08:00
Conroy Cheers
f5792c7c4a
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson ( #9735 )
...
Signed-off-by: Conroy Cheers <conroy@corncheese.org >
2024-11-26 10:26:28 -08:00
Murali Andoorveedu
db66e018ea
[Bugfix] Fix for Spec model TP + Chunked Prefill ( #10232 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
Signed-off-by: Sourashis Roy <sroy@roblox.com >
Co-authored-by: Sourashis Roy <sroy@roblox.com >
2024-11-26 09:11:16 -08:00
Kunshang Ji
1f6584ee85
[V1] Enable profile for LLMEngine ( #10665 )
2024-11-26 10:36:45 +00:00
youkaichao
334d64d1e8
[ci] add vllm_test_utils ( #10659 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-26 00:20:04 -08:00
Cyrus Leung
940635343a
[Misc] Remove outdated init protocols ( #10655 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-26 14:55:00 +08:00
Sage Moore
9a88f89799
custom allreduce + torch.compile ( #10121 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-25 22:00:16 -08:00
Ricky Xu
519e8e4182
[v1] EngineArgs for better config handling for v1 ( #10382 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-25 21:09:43 -08:00
Sanket Kale
a6760f6456
[Feature] vLLM ARM Enablement for AARCH64 CPUs ( #9228 )
...
Signed-off-by: Sanket Kale <sanketk.kale@fujitsu.com >
Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-11-25 18:32:39 -08:00
youkaichao
45ac4ff270
[bugfix] fix aria model and add torch.compile ( #10645 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 18:32:09 -08:00
youkaichao
6e9ff050c8
[misc] do not read HOST_IP ( #10644 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 17:04:50 -08:00
Shane A
9db713a1dc
[Model] Add OLMo November 2024 model ( #10503 )
2024-11-25 17:26:40 -05:00
Cyrus Leung
1b583cfefa
[Doc] Fix typos in docs ( #10636 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 10:15:45 -08:00
Cyrus Leung
cf73f0c95e
[Model] Enable optional prefix when loading embedding models ( #10639 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 18:14:33 +00:00
zhou fan
b1d920531f
[Model]: Add support for Aria model ( #10514 )
...
Signed-off-by: xffxff <1247714429@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-11-25 18:10:55 +00:00
Simon Mo
452a4e80c3
[Docs] Add Snowflake Slides ( #10641 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-25 09:34:46 -08:00
Wallas Henrique
c27df94e1f
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices ( #9850 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-11-25 12:23:32 -05:00
Chauncey
d04b13a380
[Bug]: Authorization ignored when root_path is set ( #10606 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-25 16:21:41 +00:00
fzyzcjy
2b0879bfc2
Super tiny little typo fix ( #10633 )
2024-11-25 13:08:30 +00:00
Cyrus Leung
ed46f14321
[Model] Support is_causal HF config field for Qwen2 model ( #10621 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 09:51:20 +00:00
youkaichao
05d1f8c9c6
[misc] move functions to config.py ( #10624 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 09:27:30 +00:00
youkaichao
25d806e953
[misc] add torch.compile compatibility check ( #10618 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-24 23:40:08 -08:00
youkaichao
65813781a2
[torch.compile] add warning for unsupported models ( #10622 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-24 23:27:51 -08:00
Jee Jee Li
7c2134beda
[torch.compile] force inductor threads ( #10620 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-24 23:04:21 -08:00
Cyrus Leung
a30a605d21
[Doc] Add encoder-based models to Supported Models page ( #10616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 06:34:07 +00:00
youkaichao
571841b7fc
[torch.compile] support encoder based models ( #10613 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 05:24:33 +00:00
Mengqing Cao
7ea3cd7c3e
[Refactor][MISC] del redundant code in ParallelConfig.postinit ( #10614 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-25 05:14:56 +00:00
Maximilien de Bayser
214efc2c3c
Support Cross encoder models ( #10400 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Co-authored-by: Flavia Beo <flavia.beo@ibm.com >
2024-11-24 18:56:20 -08:00
Zhuohan Li
49628fe13e
[Doc] Update README.md with Ray Summit talk links ( #10610 )
2024-11-24 16:45:09 -08:00
youkaichao
e4fbb14414
[doc] update the code to add models ( #10603 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-24 11:21:40 -08:00
youkaichao
c055747867
[model][utils] add extract_layer_index utility function ( #10599 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-23 22:22:54 -08:00
youkaichao
eda2b3589c
Revert "Print running script to enhance CI log readability" ( #10601 )
2024-11-23 21:31:47 -08:00
Jee Jee Li
1c445dca51
[CI/Build] Print running script to enhance CI log readability ( #10594 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-24 03:57:13 +00:00
Jee Jee Li
1700c543a5
[Bugfix] Fix LoRA weight sharding ( #10450 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-23 17:23:17 -08:00
Jee Jee Li
17d8fc1806
[bugfix] Fix example/tensorize_vllm_model tests ( #10595 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-23 17:22:33 -08:00
Isotr0py
04668ebe7a
[Bugfix] Avoid import AttentionMetadata explicitly in Mllama ( #10593 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-23 18:12:20 +00:00
Nishidha
651f6c31ac
For ppc64le, disabled tests for now and addressed space issues ( #10538 )
2024-11-23 09:33:53 +00:00
JiHuazhong
86a44fb896
[Platforms] Refactor openvino code ( #10573 )
...
Signed-off-by: statelesshz <hzji210@gmail.com >
2024-11-22 22:23:12 -08:00
Isotr0py
4cfe5d2bca
[Bugfix] multi_modal_kwargs broadcast for CPU tensor parallel ( #10541 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-22 21:25:46 -08:00
Cyrus Leung
c8acd80548
[2/N] handling placeholders in merged multi-modal processor ( #10485 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-22 21:25:09 -08:00
Ricky Xu
4634a89d18
Prefix Cache Aware Scheduling [1/n] ( #10128 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-22 21:15:55 -08:00
kliuae
7c25fe45a6
[AMD] Add support for GGUF quantization on ROCm ( #10254 )
2024-11-22 21:14:49 -08:00
Michael Goin
02a43f82a9
Update default max_num_batch_tokens for chunked prefill to 2048 ( #10544 )
2024-11-22 21:14:19 -08:00
Chen Wu
cfea9c04ef
[Model] Fix Baichuan BNB online quantization ( #10572 )
...
Signed-off-by: Chen Wu <cntryroa@gmail.com >
2024-11-22 21:13:59 -08:00
Varun Vinayak Shenoy
7d8ffb344f
[Bugfix] Internal Server Error when tool_choice is incorrect. ( #10567 )
...
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com >
2024-11-22 21:13:29 -08:00
youkaichao
4aba6e3d1a
[core] gemma2 full context length support ( #10584 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 20:13:54 -08:00
Tyler Michael Smith
978b39744b
[Misc] Add pynccl wrappers for all_gather and reduce_scatter ( #9432 )
2024-11-22 22:14:03 -05:00
Russell Bryant
ebda51968b
[Core] Fix broken log configuration ( #10458 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-23 10:23:51 +08:00
Travis Johnson
9195dbdbca
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use ( #10164 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-23 10:17:38 +08:00
youkaichao
d559979c54
[bugfix] fix cpu tests ( #10585 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 17:34:03 -08:00
Zhonghua Deng
d345f409b7
[V1] EngineCore supports profiling ( #10564 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2024-11-22 17:16:15 -08:00
Russell Bryant
28598f3939
[Core] remove temporary local variables in LLMEngine.__init__ ( #10577 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-22 16:22:53 -08:00
zixuanzhang226
948c859571
support bitsandbytes quantization with qwen model ( #10549 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-11-22 16:16:14 -08:00
Ricky Xu
97814fbf0f
[v1] Refactor KVCacheManager for more hash input than token ids ( #10507 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-11-22 23:27:25 +00:00
youkaichao
eebad39f26
[torch.compile] support all attention backends ( #10558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 14:04:42 -08:00
youkaichao
db100c5cde
[bugfix] fix full graph tests ( #10581 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 10:02:14 -08:00
Noam Gat
11fcf0e066
Remove token-adding chat embedding params ( #10551 )
...
Signed-off-by: Noam Gat <noamgat@gmail.com >
2024-11-21 23:59:47 -08:00
Isotr0py
b6374e09b0
[Bugfix] Fix Phi-3 BNB quantization with tensor parallel ( #9948 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-22 15:01:56 +08:00
youkaichao
a111d0151f
[platforms] absorb worker cls difference into platforms folder ( #10555 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2024-11-21 21:00:32 -08:00
Woosuk Kwon
446c7806b2
[Minor] Fix line-too-long ( #10563 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 19:40:40 -08:00
youkaichao
33e0a2540a
[9/N] torch.compile LLM usage ( #10552 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 19:13:31 -08:00
Simon Mo
aed074860a
[Benchmark] Add new H100 machine ( #10547 )
2024-11-21 18:27:20 -08:00
Michael Goin
9afa014552
Add small example to metrics.rst ( #10550 )
2024-11-21 23:43:43 +00:00
Woosuk Kwon
46fe9b46d8
[Minor] Revert change in offline inference example ( #10545 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 21:28:16 +00:00
youkaichao
cf656f5a02
[misc] improve error message ( #10553 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 13:13:17 -08:00
Yunmeng
edec3385b6
[CI][Installation] Avoid uploading CUDA 11.8 wheel ( #10535 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-11-21 13:03:58 -08:00
Woosuk Kwon
f9310cbd0c
[V1] Fix Compilation config & Enable CUDA graph by default ( #10528 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 12:53:39 -08:00
youkaichao
7560ae5caf
[8/N] enable cli flag without a space ( #10529 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 12:30:42 -08:00
Cyrus Leung
e7a8341c7c
[Bugfix] Allow token ID-only inputs in Qwen2-Audio ( #10536 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-21 18:09:43 +00:00
Roger Wang
c51e397fe8
[Misc] Suppress duplicated logging regarding multimodal input pipeline ( #10530 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-21 09:21:31 -08:00
Jee Jee Li
2385b60d83
[Kernel] Register punica ops directly ( #10522 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-21 09:18:11 -08:00
Chauncey
da7e702c6f
[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored ( #10180 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-21 16:24:32 +00:00
Xiaoyu Zhang
4d676f0852
[Bugfix] Embedding model pooling_type equals ALL and multi input's bug ( #10494 )
2024-11-21 14:40:02 +00:00
Isotr0py
d5ec121f95
[Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models ( #10518 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-21 14:20:08 +00:00
Wang, Yi
8a93a598d9
fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len ( #10524 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
2024-11-21 11:15:36 +00:00
Alex Brooks
1cfde82ffd
[Model] Add Support for Multimodal Granite Models ( #10291 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-21 10:46:20 +00:00
Zhong Qishuai
f0e0238016
[Doc] fix a small typo in docstring of llama_tool_parser ( #10513 )
2024-11-21 09:05:23 +00:00
youkaichao
aaddce5d26
[platforms] improve error message for unspecified platforms ( #10520 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 23:07:56 -08:00
Cyrus Leung
3430857b64
[Misc] Increase default video fetch timeout ( #10495 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 23:06:42 -08:00
Luka Govedič
8b0fe06c89
[torch.compile] Inductor code caching fix ( #10273 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Signed-off-by: Luka Govedic <luka.govedic@gmail.com >
2024-11-20 21:44:57 -08:00
Mengqing Cao
9d827170a3
[Platforms] Add device_type in Platform ( #10508 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-21 04:44:20 +00:00
Pavani Majety
6c1208d083
[Core] Add Sliding Window Support with Flashinfer ( #10462 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2024-11-20 19:56:47 -08:00
youkaichao
388ee3de66
[torch.compile] limit inductor threads and lazy import quant ( #10482 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 18:36:33 -08:00
Woosuk Kwon
2f77b6cfec
[TPU] Implement prefix caching for TPUs ( #10307 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-20 13:54:15 -08:00
Guillaume Calmettes
c68f7ede6a
[Bugfix]: allow extra fields in requests to openai compatible server ( #10463 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-20 16:42:21 -05:00
youkaichao
0cd3d9717e
[7/N] torch.compile, reduce compilation time ( #10460 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 11:20:38 -08:00
Simon Mo
5f1d6af2b6
[perf bench] H200 development ( #9768 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-20 11:06:56 -08:00
youkaichao
772a66732d
[platforms] restore xpu check for parallel config ( #10479 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 17:13:28 +00:00
Li, Jiang
63f1fde277
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU ( #10355 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-20 10:57:39 +00:00
Mengqing Cao
d5b28447e0
[Platforms] Refactor xpu code ( #10468 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-19 22:52:13 -08:00
Cyrus Leung
09dbf9ff16
[Bugfix] Handle conflicts between modern and legacy fields ( #10471 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 14:45:08 +08:00
Sky Lee
343041c4c4
[model] Reduce medusa weight ( #10454 )
...
Signed-off-by: skylee-01 <497627264@qq.com >
2024-11-20 06:05:55 +00:00
Kevin H. Luu
ed701ca963
[ci/build] Combine nightly and optional ( #10465 )
2024-11-19 21:36:03 -08:00
wchen61
7629a9c6e5
[CI/Build] Support compilation with local cutlass path ( #10423 ) ( #10424 )
2024-11-19 21:35:50 -08:00
Rafael Vasquez
709c9f1f25
[CI/Build] Add sphinx/rst linter for docs ( #10366 )
2024-11-19 21:35:31 -08:00
Cyrus Leung
b4be5a8adb
[Bugfix] Enforce no chunked prefill for embedding models ( #10470 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 05:12:51 +00:00
Isotr0py
ad44437ba3
[Bugfix] Fix Mamba model initialization and MLP Speculator weights loading ( #10456 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-20 05:04:05 +00:00
Yanyi Liu
9e05252b46
[Misc] Add __setitem__ for LazyDict ( #10469 )
...
Signed-off-by: Yanyi Liu <wolfsonliu@163.com >
2024-11-20 04:44:57 +00:00
Lucas Wilkinson
d200972e7f
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) ( #10464 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-19 19:40:33 -08:00
Alexei-V-Ivanov-AMD
d5b68aba2f
[CI/Build] Update Dockerfile.rocm ( #10434 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2024-11-19 17:19:59 -08:00
Maximilien de Bayser
a324d3a1a7
Change granite chat template to keep json list formatting for tool calls ( #10452 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
2024-11-19 18:16:54 -07:00
ElizaWszola
b00b33d77e
[Model][Quantization] HQQ support through Marlin kernel expansion ( #9766 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-19 13:31:12 -08:00
Russell Bryant
efa9084628
[Core] Avoid metrics log noise when idle ( #8868 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 21:05:25 +00:00
youkaichao
803f37eaaa
[6/N] torch.compile rollout to users ( #10437 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-19 10:09:03 -08:00
Russell Bryant
fd9f124971
[Doc] fix link for page that was renamed ( #10455 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 09:48:30 -08:00
Manjul Mohan
1ea291a417
Fix: Build error seen on Power Architecture ( #10421 )
...
Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com >
Signed-off-by: B-201 <Joy25810@foxmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: ismael-dm <ismaeldm99@gmail.com >
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: yan ma <yan.ma@intel.com >
Signed-off-by: Angus Wang <wangjadehao@gmail.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: rickyx <rickyx@anyscale.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Manjul Mohan manjul.mohan@ibm.com <manjulmohan@ltcd97-lp2.aus.stglabs.ibm.com >
Co-authored-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: ismael-dm <ismaeldm99@gmail.com >
Co-authored-by: Andrew Nesbitt <andrewnez@gmail.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Angus Wang <wangjadehao@gmail.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Ricky Xu <rickyx@anyscale.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 09:34:57 -08:00
Patrick von Platen
11fd7ea639
[Pixtral-Large] Pixtral actually has no bias in vision-lang adapter ( #10449 )
2024-11-19 17:33:06 +00:00
COSMOPlat
f028dff33d
[BugFix] Fix hermes tool parser output error stream arguments in some cases ( #10395 ) ( #10398 )
...
Signed-off-by: xiyuan lee <lixiyuan@haier.com >
2024-11-19 13:42:50 +00:00
Yuan
b4614656b8
[CI][CPU] adding numa node number as container name suffix ( #10441 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-11-19 13:16:43 +00:00
youkaichao
25f9c78961
[misc][plugin] improve plugin loading ( #10443 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-19 10:43:21 +00:00
Russell Bryant
5390d6664f
[Doc] Add the start of an arch overview page ( #10368 )
2024-11-19 09:52:11 +00:00
Jee Jee Li
382b6a4852
[Misc] Avoid misleading warning messages ( #10438 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-19 08:54:58 +00:00
Travis Johnson
272e31c0bd
[Bugfix] Guard for negative counter metrics to prevent crash ( #10430 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-19 04:57:10 +00:00
Michael Goin
74f8c2cf5f
Add openai.beta.chat.completions.parse example to structured_outputs.rst ( #10433 )
2024-11-19 04:37:46 +00:00
Mengqing Cao
8c1fb50705
[Platform][Refactor] Extract func get_default_attn_backend to Platform ( #10358 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-11-19 11:22:26 +08:00
Jee Jee Li
7eb719df13
[Bugfix]Fix Phi-3 BNB online quantization ( #10417 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-19 03:21:42 +00:00
Kevin H. Luu
284203f171
[ci/build] Have dependabot ignore all patch update ( #10436 )
...
We have too many dependencies and all patch updates can be a little noisy. This is to have dependabot ignore all patch version updates.
2024-11-19 01:04:25 +00:00
Ricky Xu
90a6c759ca
[misc] partial prefix & random input generation benchmark ( #9929 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-18 15:39:14 -08:00
youkaichao
2298e69b5f
[ci][bugfix] fix kernel tests ( #10431 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 15:29:37 -08:00
youkaichao
a03ea40792
[3/N][torch.compile] consolidate custom op logging ( #10399 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 15:14:59 -08:00
Lucas Wilkinson
96d999fbe8
[Kernel] Initial Machete W4A8 support + Refactors ( #9855 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-18 12:59:29 -07:00
Angus Wang
c2170a5b39
[Kernel] Explicitly specify other value in tl.load calls ( #9014 )
...
Signed-off-by: Angus Wang <wangjadehao@gmail.com >
2024-11-18 11:39:40 -08:00
Yan Ma
6b2d25efc7
[Hardware][XPU] AWQ/GPTQ support for xpu backend ( #10107 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-11-18 11:18:05 -07:00
Michael Goin
281cc4b3cd
[Model][Bugfix] Support TP for PixtralHF ViT ( #10405 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-18 10:04:14 -08:00
Andrew Nesbitt
4f686d139f
Fix open_collective value in FUNDING.yml ( #10426 )
...
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com >
2024-11-18 09:52:42 -08:00
ismael-dm
31894a2155
[Doc] Add documentation for Structured Outputs ( #9943 )
...
Signed-off-by: ismael-dm <ismaeldm99@gmail.com >
2024-11-18 09:52:12 -08:00
youkaichao
7851b45196
[5/N][torch.compile] torch.jit.script --> torch.compile ( #10406 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 23:20:06 +08:00
B-201
4186be8111
[Doc] Update doc for LoRA support in GLM-4V ( #10425 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-18 15:08:30 +00:00
Isotr0py
e7ebb662d7
[Model] Remove transformers attention porting in VITs ( #10414 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-18 21:45:21 +08:00
B-201
5be4e52b65
[Model][LoRA]LoRA support added for glm-4v ( #10418 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-18 12:57:10 +00:00
Maybewuss
01aae1cc68
[Model] Remove redundant softmax when using PoolingType.STEP ( #10415 )
2024-11-18 10:05:36 +00:00
lkchen
c7dec926f6
[VLM] Report multi_modal_placeholders in output ( #10407 )
...
Signed-off-by: Linkun Chen <lkchen+anyscale@github.com >
2024-11-18 16:06:16 +08:00
youkaichao
51bb12d17b
[4/N][torch.compile] clean up set_torch_compile_backend ( #10401 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-17 23:57:20 -08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
47826cacf0
[Bugfix] Ignore ray reinit error when current platform is ROCm or XPU ( #10375 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2024-11-18 11:29:26 +08:00
Isotr0py
c4e464333e
[Misc] Add uninitialized params tracking for AutoWeightsLoader ( #10327 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-18 09:07:46 +08:00
wchen61
d1557e66d3
[Misc] Enhance offline_inference to support user-configurable paramet… ( #10392 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2024-11-17 11:32:40 +00:00
电脑星人
80d85c5d7b
[Bugfix] Fix mrope_position_delta in non-last prefill chunk ( #10403 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-17 08:50:24 +00:00
Kunshang Ji
76aab90ab6
[Hardware] [HPU]add mark_step for hpu ( #10239 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-17 00:44:44 -08:00
youkaichao
8d74b5aee9
[platforms] refactor cpu code ( #10402 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 23:14:23 -08:00
Isotr0py
cf349c4a97
[Bugfix][CPU] Fix CPU embedding runner with tensor parallel ( #10394 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-16 23:12:04 -08:00
Chendi.Xue
905d0f0af4
[CI/Build] Fix IDC hpu [Device not found] issue ( #10384 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2024-11-17 14:58:22 +08:00
Roger Wang
643ecf7b11
[V1] Refactor model executable interface for all text-only language models ( #10374 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-17 05:18:46 +00:00
youkaichao
4fd9375028
[2/N][torch.compile] make compilation cfg part of vllm cfg ( #10383 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 18:02:14 -08:00
Woosuk Kwon
661a34fd4f
[V1] Add code owners for V1 ( #10397 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-16 10:45:26 -08:00
电脑星人
361c29e174
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled ( #10388 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-17 02:10:00 +08:00
Sky Lee
b98d89efd4
[Misc] Medusa supports custom bias ( #10361 )
2024-11-16 16:33:01 +00:00
Jaehyun An
8b6725b0cf
[Misc] Update benchmark to support image_url file or http ( #10287 )
...
Signed-off-by: rbbang <anjaehyun87@gmail.com >
2024-11-16 18:15:40 +08:00
rasmith
1d75472626
[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel ( #10385 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2024-11-16 09:55:05 +00:00
youkaichao
2f427c2d16
[misc][plugin] improve log messages ( #10386 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 01:23:20 -08:00
youkaichao
755b85359b
[doc] add doc for the plugin system ( #10372 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-15 21:46:27 -08:00
Cyrus Leung
32e46e000f
[Frontend] Automatic detection of chat content format from AST ( #9919 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-16 13:35:40 +08:00
Michael Green
4f168f69a3
[Docs] Misc updates to TPU installation instructions ( #10165 )
2024-11-15 13:26:17 -08:00
Russell Bryant
3e8d14d8a1
[Doc] Move PR template content to docs ( #10159 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-15 13:20:20 -08:00
Russell Bryant
a067f85e08
[Frontend] Add --version flag to CLI ( #10369 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-15 13:13:53 -08:00
Simon Mo
c76ac49d26
[Docs] Add Nebius as sponsors ( #10371 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-15 12:47:40 -08:00
Simon Mo
a6221a144a
[Misc] bump mistral common version ( #10367 )
...
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-15 09:48:07 -08:00
ElizaWszola
79ee45b428
[Misc] Bump up test_fused_moe tolerance ( #10364 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-15 16:31:18 +00:00
Guillaume Calmettes
691a3ec047
[Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer ( #10363 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-15 14:50:40 +00:00
youkaichao
3a763ba0c3
[core][misc] keep compatibility for old-style classes ( #10356 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-15 13:55:51 +00:00
shangmingc
f2056f726d
[Misc] Fix some help info of arg_utils to improve readability ( #10362 )
2024-11-15 12:40:30 +00:00
Jee Jee Li
1d65ec7eeb
[Bugfix] Fix fully sharded LoRA bug ( #10352 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-15 10:34:58 +00:00
Xin Yang
26908554b2
[Doc] Remove float32 choice from --lora-dtype ( #10348 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-11-15 10:22:57 +00:00
Cyrus Leung
b311efd0bd
[Misc] Fix import error in tensorizer tests and cleanup some code ( #10349 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-15 09:34:17 +00:00
wchen61
3d158cdc8d
Add default value to avoid Falcon crash ( #5363 ) ( #10347 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2024-11-15 08:52:20 +00:00
Simon Mo
02dbf30e9a
[Build] skip renaming files for release wheels pipeline ( #9671 )
...
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-14 23:31:52 -08:00
Cyrus Leung
2ac6d0e75b
[Misc] Consolidate pooler config overrides ( #10351 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-15 06:59:00 +00:00
Sky Lee
2ec8827288
[Bugfix] Qwen-vl output is inconsistent in speculative decoding ( #10350 )
2024-11-15 05:40:10 +00:00
Cyrus Leung
b40cf6402e
[Model] Support Qwen2 embeddings and use tags to select model tests ( #10184 )
2024-11-14 20:23:09 -08:00
Tyler Michael Smith
2885ba0e24
[Misc] Change RedundantReshapesPass and FusionPass logging from info to debug ( #10308 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-15 02:44:26 +00:00
Luka Govedič
bf2ddc6610
[bugfix] Fix static asymmetric quantization case ( #10334 )
...
Signed-off-by: Daniël de Kok <me@danieldk.eu >
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Daniël de Kok <me@danieldk.eu >
2024-11-15 09:35:11 +08:00
Cyrus Leung
972112d82f
[Bugfix] Fix unable to load some models ( #10312 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-14 16:55:54 -08:00
Patrick von Platen
11cd1ae6ad
[Tool parsing] Improve / correct mistral tool parsing ( #10333 )
2024-11-15 00:42:49 +00:00
Zijin Xiao
554af9228d
[Bugfix] use AF_INET6 for OpenAI Compatible Server with ipv6 ( #9583 )
...
Signed-off-by: xiaozijin <xiaozijin@bytedance.com >
2024-11-14 16:38:53 -08:00
Murali Andoorveedu
b2e0ad3b59
[Perf] Reduce peak memory usage of llama ( #10339 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2024-11-15 00:38:20 +00:00
Maximilien de Bayser
4a18fd14ba
Support Roberta embedding models ( #9387 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Co-authored-by: Flavia Beo <flavia.beo@ibm.com >
2024-11-14 21:23:29 +00:00
Woosuk Kwon
1dbae0329c
[Docs] Publish meetup slides ( #10331 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-14 16:19:38 +00:00
Cyrus Leung
675d603400
[CI/Build] Make shellcheck happy ( #10285 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-14 09:47:53 +00:00
Isotr0py
03025c023f
[CI/Build] Fix CPU CI online inference timeout ( #10314 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-14 16:45:32 +08:00
youkaichao
29f3ef26a3
[ci][distributed] disable hanging tests ( #10317 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-14 00:23:39 -08:00
B-201
294bf467ba
[Model] Add BNB quantization support for Idefics3 ( #10310 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-14 06:31:44 +00:00
Guillaume Calmettes
52b48c1ead
[BugFix]: properly deserialize tool_calls iterator before processing by mistral-common when MistralTokenizer is used ( #9951 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-14 04:48:16 +00:00
Mike Depinet
f67ce05d0b
[Frontend] Pythonic tool parser ( #9859 )
...
Signed-off-by: Mike Depinet <mike@fixie.ai >
2024-11-14 04:14:34 +00:00
Russell Bryant
e0853b6508
[Misc] format.sh: Simplify tool_version_check ( #10305 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-14 11:12:35 +08:00
youkaichao
504ac53d18
[misc] error early for old-style class ( #10304 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-13 18:55:39 -08:00
Isotr0py
15bb8330aa
[Bugfix] Fix tensor parallel for qwen2 classification model ( #10297 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-14 10:54:59 +08:00
HoangCongDuc
ac49b59d8b
[Bugfix] bitsandbytes models fail to run pipeline parallel ( #10200 )
...
Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com >
2024-11-13 09:56:39 -07:00
Cyrus Leung
0b8bb86bf1
[1/N] Initial prototype for multi-modal processor ( #10044 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-13 12:39:03 +00:00
Roger Wang
bb7991aa29
[V1] Add missing tokenizer options for Detokenizer ( #10288 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-13 11:02:56 +00:00
B-201
d909acf9fe
[Model][LoRA]LoRA support added for idefics3 ( #10281 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-13 17:25:59 +08:00
Pavani Majety
b6dde33019
[Core] Flashinfer - Remove advance step size restriction ( #10282 )
2024-11-13 16:29:32 +08:00
Austin Veselka
1b886aa104
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 ( #9944 )
...
Signed-off-by: FurtherAI <austin.veselka@lighton.ai >
Co-authored-by: FurtherAI <austin.veselka@lighton.ai >
2024-11-13 08:28:13 +00:00
电脑星人
3945c82346
[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions ( #10221 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-13 07:07:22 +00:00
Xin Yang
032fcf16ae
[Doc] Fix typo in arg_utils.py ( #10264 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-11-12 21:54:52 -08:00
Dipika Sikka
56a955e774
Bump to compressed-tensors v0.8.0 ( #10279 )
...
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2024-11-12 21:54:10 -08:00
Woosuk Kwon
bbd3e86926
[V1] Support VLMs with fine-grained scheduling ( #9871 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-11-13 04:53:13 +00:00
youkaichao
0d4ea3fb5c
[core][distributed] use tcp store directly ( #10275 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 17:36:08 -08:00
Woosuk Kwon
112fa0bbe5
[V1] Fix CI tests on V1 engine ( #10272 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 16:17:20 -08:00
youkaichao
377b74fe87
Revert "[ci][build] limit cmake version" ( #10271 )
2024-11-12 15:06:48 -08:00
youkaichao
18081451f9
[doc] improve debugging doc ( #10270 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 14:43:52 -08:00
youkaichao
96ae0eaeb2
[doc] fix location of runllm widget ( #10266 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 14:34:39 -08:00
Woosuk Kwon
1f55e05713
[V1] Enable Inductor when using piecewise CUDA graphs ( #10268 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 13:39:56 -08:00
Umesh
8a06428c70
[LoRA] Adds support for bias in LoRA ( #5733 )
...
Signed-off-by: Umesh Deshpande <udeshpa@us.ibm.com >
Co-authored-by: Umesh Deshpande <udeshpa@us.ibm.com >
2024-11-12 11:08:40 -08:00
sroy745
b41fb9d3b1
[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers ( #9982 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2024-11-12 10:53:57 -08:00
Woosuk Kwon
7c65527918
[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest ( #10245 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 08:57:14 -08:00
zifeitong
47db6ec831
[Frontend] Add per-request number of cached token stats ( #10174 )
2024-11-12 16:42:28 +00:00
Jie Fu (傅杰)
176fcb1c71
[Bugfix] Fix QwenModel argument ( #10262 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com >
2024-11-12 16:36:51 +00:00
Jee Jee Li
a838ba7254
[Misc]Fix Idefics3Model argument ( #10255 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-12 13:07:11 +00:00
Guillaume Calmettes
36c513a076
[BugFix] Do not raise a ValueError when tool_choice is set to the supported none option and tools are not defined. ( #10000 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-12 11:13:46 +00:00
Yuan
d201d41973
[CI][CPU]refactor CPU tests to allow to bind with different cores ( #10222 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-11-12 10:07:32 +00:00
youkaichao
3a28f18b0b
[doc] explain the class hierarchy in vLLM ( #10240 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 22:56:44 -08:00
Aleksandr Malyshev
812c981fa0
Splitting attention kernel file ( #10091 )
...
Signed-off-by: maleksan85 <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2024-11-11 22:55:07 -08:00
Jee Jee Li
7f5edb5900
[Misc][LoRA] Replace hardcoded cuda device with configurable argument ( #10223 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-12 11:10:15 +08:00
youkaichao
eea55cca5b
[1/N] torch.compile user interface design ( #10237 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 18:01:06 -08:00
Russell Bryant
9cdba9669c
[Doc] Update help text for --distributed-executor-backend ( #10231 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-12 09:55:09 +08:00
youkaichao
d1c6799b88
[doc] update debugging guide ( #10236 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 15:21:12 -08:00
Robert Shaw
6ace6fba2c
[V1] AsyncLLM Implementation ( #9826 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-11 23:05:38 +00:00
Nikolai Shcheglov
08f93e7439
Make shutil rename in python_only_dev ( #10233 )
...
Signed-off-by: shcheglovnd <shcheglovnd@avride.ai >
2024-11-11 14:29:19 -08:00
Woosuk Kwon
9d5b4e4dea
[V1] Enable custom ops with piecewise CUDA graphs ( #10228 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:58:07 -08:00
youkaichao
8a7fe47d32
[misc][distributed] auto port selection and disable tests ( #10226 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 11:54:59 -08:00
Yuan Tang
4800339c62
Add docs on serving with Llama Stack ( #10183 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2024-11-11 11:28:55 -08:00
Woosuk Kwon
fe15729a2b
[V1] Use custom ops for piecewise CUDA graphs ( #10227 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:26:48 -08:00
youkaichao
330e82d34a
[v1][torch.compile] support managing cudagraph buffer ( #10203 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:10:27 -08:00
Woosuk Kwon
d7a4f2207b
[V1] Do not use inductor for piecewise CUDA graphs ( #10225 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:05:57 -08:00
Woosuk Kwon
f9dadfbee3
[V1] Fix detokenizer ports ( #10224 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 10:42:07 -08:00
dependabot[bot]
25144ceed0
Bump actions/setup-python from 5.2.0 to 5.3.0 ( #10209 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 17:24:10 +00:00
youkaichao
e6de9784d2
[core][distributed] add stateless process group ( #10216 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 09:02:14 -08:00
Yangcheng Li
36fc439de0
[Doc] fix doc string typo in block_manager swap_out function ( #10212 )
2024-11-11 08:53:07 -08:00
harrywu
874f551b36
[Metrics] add more metrics ( #4464 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-12 00:17:38 +08:00
Isotr0py
2cebda42bb
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner ( #10218 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-11 12:37:58 +00:00
Roger Wang
5fb1f935b0
[V1] Allow tokenizer_mode and trust_remote_code for Detokenizer ( #10211 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-11 18:01:18 +08:00
Jee Jee Li
36e4acd02a
[LoRA][Kernel] Remove the unused libentry module ( #10214 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-11 09:43:23 +00:00
Isotr0py
58170d6503
[Hardware][CPU] Add embedding models support for CPU backend ( #10193 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-11 08:54:28 +00:00
dependabot[bot]
9804ac7c7c
Bump the patch-update group with 5 updates ( #10210 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 07:22:40 +00:00
youkaichao
f89d18ff74
[6/N] pass whole config to inner model ( #10205 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 06:41:46 +00:00
youkaichao
f0f2e5638e
[doc] improve debugging code ( #10206 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-10 17:49:40 -08:00
yansh97
ad9a78bf64
[Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py ( #10196 )
2024-11-11 00:14:22 +00:00
youkaichao
73b9083e99
[misc] improve cloudpickle registration and tests ( #10202 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 00:10:53 +00:00
Shawn Du
20cf2f553c
[Misc] small fixes to function tracing file path ( #9543 )
...
Signed-off-by: Shawn Du <shawnd200@outlook.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-10 15:21:06 -08:00
Yongzao
bfb7d61a7c
[doc] Polish the integration with huggingface doc ( #10195 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-10 10:22:04 -08:00
FuryMartin
19682023b6
[Doc] Fix typo error in CONTRIBUTING.md ( #10190 )
...
Signed-off-by: FuryMartin <furymartin9910@outlook.com >
2024-11-10 07:47:24 +00:00
youkaichao
9fa4bdde9d
[ci][build] limit cmake version ( #10188 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 16:27:26 -08:00
Cyrus Leung
51c2e1fcef
[CI/Build] Split up models tests ( #10069 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 11:39:14 -08:00
Krishna Mandal
b09895a618
[Frontend][Core] Override HF config.json via CLI ( #5836 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 16:19:27 +00:00
cjackal
d88bff1b96
[Frontend] add add_request_id middleware ( #9594 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2024-11-09 10:18:29 +00:00
Zhao Yingzhuo
9e37266420
bugfix: fix the bug that stream generate not work ( #2756 )
2024-11-09 10:09:48 +00:00
youkaichao
8a4358ecb5
[doc] explaining the integration with huggingface ( #10173 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 01:02:54 -08:00
youkaichao
bd46357ad9
[bugfix] fix broken tests of mlp speculator ( #10177 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 00:04:50 -08:00
bnellnm
f192aeba74
[Bugfix] Enable some fp8 and quantized fullgraph tests ( #10171 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-09 08:01:27 +00:00
Chendi.Xue
8e1529dc57
[CI/Build] Add run-hpu-test.sh script ( #10167 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2024-11-09 06:26:52 +00:00
youkaichao
1a95f10ee7
[5/N] pass the whole config to model ( #9983 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 14:17:28 +08:00
Cyrus Leung
49d2a41a86
[Doc] Adjust RunLLM location ( #10176 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 20:07:10 -08:00
Isotr0py
47672f38b5
[CI/Build] Fix VLM broadcast tests tensor_parallel_size passing ( #10161 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-09 04:02:59 +00:00
Michael Goin
f83feccd7f
[Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module ( #10169 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-09 03:36:46 +00:00
Cyrus Leung
e0191a95d8
[0/N] Rename MultiModalInputs to MultiModalKwargs ( #10040 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 11:31:02 +08:00
Li, Jiang
d7edca1dee
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking ( #6892 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 03:27:11 +00:00
rasmith
127c07480e
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case ( #9857 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2024-11-08 19:59:22 -05:00
bnellnm
10b67d865d
[Bugfix] SymIntArrayRef expected to contain concrete integers ( #10170 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-08 14:44:18 -08:00
Luka Govedič
4f93dfe952
[torch.compile] Fuse RMSNorm with quant ( #9138 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-11-08 21:20:08 +00:00
Florian Zimmermeister
e1b5a82179
Rename vllm.logging to vllm.logging_utils ( #10134 )
2024-11-08 20:53:24 +00:00
Luka Govedič
87713c6053
[CI/Build] Ignore .gitignored files for shellcheck ( #10162 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2024-11-08 19:53:36 +00:00
Woosuk Kwon
b5815c8413
[V1] Fix non-cudagraph op name ( #10166 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-08 10:23:04 -08:00
Rafael Vasquez
6b30471586
[Misc] Improve Web UI ( #10090 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-11-08 09:51:04 -08:00
sroy745
f6778620a9
Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 ( #10136 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2024-11-08 15:56:18 +00:00
Patrick von Platen
0535e5fe6c
Fix edge case Mistral tokenizer ( #10152 )
2024-11-08 15:42:27 +00:00
Cyrus Leung
b489fc3c91
[CI/Build] Update CPU tests to include all "standard" tests ( #5481 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 23:30:04 +08:00
Roger Wang
208ce622c7
[V1]Enable APC by default only for text models ( #10148 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-08 14:39:41 +00:00
Isotr0py
1ff4aed5bd
[Model] Expose size to Idefics3 as mm_processor_kwargs ( #10146 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-08 09:56:58 +00:00
Yan Ma
f10797c0ce
[Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator ( #10144 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-11-08 09:41:03 +00:00
Cyrus Leung
f4c2187e29
[Misc] Fix typo in #5895 ( #10145 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 09:07:01 +00:00
Michael Goin
aea6ad629f
Add hf_transfer to testing image ( #10096 )
2024-11-08 08:35:25 +00:00
Tao He
da07a9ead7
Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. ( #9285 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2024-11-08 05:31:28 +00:00
Russell Bryant
3a7f15a398
[Doc] Move CONTRIBUTING to docs site ( #9924 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-08 05:15:12 +00:00
Mengqing Cao
7371749d54
[Misc] Fix ImportError causing by triton ( #9493 )
2024-11-08 05:08:51 +00:00
DearPlanet
ad39bd640c
[Bugfix] Add error handling when server cannot respond any valid tokens ( #5895 )
2024-11-08 04:58:37 +00:00
whyiug
40d0e7411d
[Doc] Update FAQ links in spec_decode.rst ( #9662 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
2024-11-08 04:44:58 +00:00
Russell Bryant
6bb52b0f97
[CI/Build] Give PR cleanup job PR write access ( #10139 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-08 12:10:20 +08:00
Cody Yu
201fc07730
[V1] Prefix caching (take 2) ( #9972 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-11-07 17:34:44 -08:00
Woosuk Kwon
42b4f46b71
[V1] Add all_token_ids attribute to Request ( #10135 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-07 17:08:24 -08:00
Jiangtao Hu
073a472728
[Misc] report relevant env vars in collect_env.py tool ( #9293 )
2024-11-07 16:14:01 -08:00
dependabot[bot]
93bff421bc
Bump actions/checkout from 4.2.1 to 4.2.2 ( #9746 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 21:44:58 +00:00
litianjian
28b2877d30
Online video support for VLMs ( #10020 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 20:25:59 +00:00
dependabot[bot]
97b8475beb
Bump actions/setup-python from 5.2.0 to 5.3.0 ( #9745 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 18:55:35 +00:00
Russell Bryant
a2f1f3b089
[CI/Build] Automate PR body text cleanup ( #10082 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 18:26:28 +00:00
Russell Bryant
3be5b26a76
[CI/Build] Add shell script linting using shellcheck ( #7925 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 18:17:29 +00:00
Russell Bryant
de0e61a323
[CI/Build] Always run mypy ( #10122 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 16:43:16 +00:00
Nicolò Lucchesi
9d43afcc53
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding ( #9291 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2024-11-07 08:15:14 -08:00
Maximilien de Bayser
ae62fd17c0
[Frontend] Tool calling parser for Granite 3.0 models ( #9027 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-11-07 07:09:02 -08:00
Atlas
a62bc0109c
[Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. ( #10105 )
...
Signed-off-by: Mozhou <spli161006@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-11-07 11:20:30 +00:00
Jiahao Li
999df95b4e
[Bugfix] Make image processor respect mm_processor_kwargs for Qwen2-VL ( #10112 )
...
Signed-off-by: Jiahao Li <liplus17@163.com >
2024-11-07 10:50:44 +00:00
Li, Jiang
a6f332d0d9
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target ( #10108 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 18:42:50 +08:00
Lei Yang
0dfba97b42
[Frontend] Fix multiple values for keyword argument error ( #10075 ) ( #10076 )
...
Signed-off-by: Lei <ylxx@live.com >
2024-11-07 09:07:19 +00:00
Flávia Béo
aa9078fa03
Adds method to read the pooling types from model's files ( #9506 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2024-11-07 08:42:40 +00:00
Russell Bryant
e036e527a0
[CI/Build] Improve mypy + python version matrix ( #10041 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 07:54:16 +00:00
Hanzhi Zhou
6192e9b8fe
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce ( #10030 )
...
Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com >
2024-11-06 23:50:47 -08:00
Rafael Vasquez
d7263a1bb8
Doc: Improve benchmark documentation ( #9927 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-11-06 23:50:35 -08:00
Russell Bryant
104d729656
[CI/Build] re-add codespell to CI ( #10083 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 22:54:46 -08:00
Cyrus Leung
db7db4aab9
[Misc] Consolidate ModelConfig code related to HF config ( #10104 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 06:00:21 +00:00
Nick Hill
1fa020c539
[V1][BugFix] Fix Generator construction in greedy + seed case ( #10097 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2024-11-07 05:06:57 +00:00
youkaichao
e7b84c394d
[doc] add back Python 3.8 ABI ( #10100 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-06 21:06:41 -08:00
Li, Jiang
a4b3e0c1e9
[Hardware][CPU] Update torch 2.5 ( #9911 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 04:43:08 +00:00
Nick Hill
29862b884b
[Frontend] Adjust try/except blocks in API impl ( #10056 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2024-11-06 20:07:51 -08:00
Yan Ma
d3859f1891
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend ( #9823 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-06 17:29:03 -08:00
Michael Goin
4ab3256644
[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 ( #10095 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-07 00:54:13 +00:00
youkaichao
719c1ca468
[core][distributed] add stateless_init_process_group ( #10072 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-06 16:42:09 -08:00
Russell Bryant
74f2f8a0f1
[CI/Build] Always run the ruff workflow ( #10092 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 22:25:23 +00:00
Joe Runde
d58268c56a
[V1] Make v1 more testable ( #9888 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-06 11:57:35 -08:00
Russell Bryant
87bd7e0515
[CI/Build] change conflict PR comment from mergify ( #10080 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 10:15:42 -08:00
Russell Bryant
098f94de42
[CI/Build] Drop Python 3.8 support ( #10038 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-06 14:31:01 +00:00
Michael Goin
399c798608
Remove ScaledActivation for AWQ ( #10057 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-06 14:27:06 +00:00
Eric
406d4cc480
[Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration ( #10022 )
...
Signed-off-by: ericperfect <ericperfectttt@gmail.com >
2024-11-06 14:13:15 +00:00
Jee Jee Li
a5bba7d234
[Model] Add Idefics3 support ( #9767 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: B-201 <Joy25810@foxmail.com >
2024-11-06 11:41:17 +00:00
Jee Jee Li
2003cc3513
[Model][LoRA]LoRA support added for LlamaEmbeddingModel ( #10071 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-06 09:49:19 +00:00
Woosuk Kwon
6a585a23d2
[Hotfix] Fix ruff errors ( #10073 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-06 01:24:28 -08:00
Konrad Zawora
a02a50e6e5
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend ( #6143 )
...
Signed-off-by: yuwenzho <yuwen.zhou@intel.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Signed-off-by: Bob Zhu <bob.zhu@intel.com >
Signed-off-by: zehao-intel <zehao.huang@intel.com >
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai >
Co-authored-by: Marceli Fylcek <mfylcek@habana.ai >
Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com >
Co-authored-by: Vivek Goel <vgoel@habana.ai >
Co-authored-by: yuwenzho <yuwen.zhou@intel.com >
Co-authored-by: Dominika Olszewska <dolszewska@habana.ai >
Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com >
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com >
Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai >
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com >
Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai >
Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com >
Co-authored-by: Ilia Taraban <tarabanil@gmail.com >
Co-authored-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai >
Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai >
Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com >
Co-authored-by: Sun Choi <schoi@habana.ai >
Co-authored-by: Iryna Boiko <iboiko@habana.ai >
Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com >
Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com >
Co-authored-by: Zehao Huang <zehao.huang@intel.com >
Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com >
Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com >
Co-authored-by: Nir David <ndavid@habana.ai >
Co-authored-by: Yu-Zhou <yu.zhou@intel.com >
Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai >
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai >
Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com >
Co-authored-by: Jacek Czaja <jczaja@habana.ai >
Co-authored-by: Yuan <yuan.zhou@outlook.com >
2024-11-06 01:09:10 -08:00
Isotr0py
a5fda50a10
[CI/Build] Fix large_gpu_mark reason ( #10070 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-06 08:50:37 +00:00
Aaron Pham
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2024-11-06 07:11:55 +00:00
youkaichao
4be3a45158
[distributed] add function to create ipc buffers directly ( #10064 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 22:35:03 -08:00
Woosuk Kwon
4089985552
[V1] Integrate Piecewise CUDA graphs ( #10058 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-05 22:16:04 -08:00
zifeitong
9d59b75593
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type ( #10054 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2024-11-06 05:13:09 +00:00
arakowsk-amd
ea928f608c
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path ( #10063 )
...
Signed-off-by: Alex Rakowski <alex.rakowski@amd.com >
Signed-off-by: Alex Rakowski <182798202+arakowsk-amd@users.noreply.github.com >
2024-11-06 05:10:40 +00:00
Travis Johnson
2bcbae704c
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer ( #10051 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-06 04:28:29 +00:00
Peter Salas
ffc0f2b47a
[Model][OpenVINO] Fix regressions from #8346 ( #10045 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2024-11-06 04:19:15 +00:00
Cyrus Leung
82bfc38d07
[Misc] Sort the list of embedding models ( #10037 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-06 04:05:05 +00:00
youkaichao
c4cacbaa7f
[v1] reduce graph capture time for piecewise cudagraph ( #10059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 18:19:50 -08:00
Sungjae Lee
0c63c34f72
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode ( #9730 )
...
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2024-11-06 01:45:45 +00:00
Wallas Henrique
966e31697b
[Bugfix] Fix pickle of input when async output processing is on ( #9931 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-11-06 00:39:26 +00:00
zifeitong
43300bd98a
[Bugfix] Properly propagate trust_remote_code settings ( #10047 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2024-11-05 16:34:40 -08:00
youkaichao
ca9844b340
[bugfix] fix weak ref in piecewise cudagraph and tractable test ( #10048 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 14:49:20 -08:00
Michael Goin
235366fe2e
[CI] Prune back the number of tests in tests/kernels/* ( #9932 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 16:02:32 -05:00
Michael Goin
02462465ea
[CI] Prune tests/models/decoder_only/language/* tests ( #9940 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 16:02:23 -05:00
Jee Jee Li
b9c64c0ca7
[Misc] Modify BNB parameter name ( #9997 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-05 14:40:08 -05:00
lkchen
d2e80332a7
[Feature] Update benchmark_throughput.py to support image input ( #9851 )
...
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net >
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net >
2024-11-05 19:30:02 +00:00
Michael Goin
a53046b16f
[Model] Support quantization of PixtralHFTransformer for PixtralHF ( #9921 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 10:42:20 -08:00
Russell Bryant
731aec5be7
[CI/Build] Limit github CI jobs based on files changed ( #9928 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-05 10:30:42 -08:00
Chenghao (Alan) Yang
09d3550372
[Misc] Add logging for CUDA memory ( #10027 )
...
Signed-off-by: Chenghao Yang <yangalan1996@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Chenghao Yang <yangalan1996@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-05 09:50:50 -08:00
Richard Liu
cd34029e91
Refactor TPU requirements file and pin build dependencies ( #10010 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2024-11-05 16:48:44 +00:00
Russell Bryant
5952d81139
[Frontend] Fix tcp port reservation for api server ( #10012 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-05 07:50:57 -08:00
Chauncey
93dee88f6b
[Misc] vllm CLI flags should be ordered for better user readability ( #10017 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-05 18:59:56 +08:00
Gene Der Su
7a83b1aec0
[BugFix] Lazy import ray ( #10021 )
2024-11-05 10:04:10 +00:00
Tyler Michael Smith
ad23318928
[Bugfix] Fixup Mamba ( #10004 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-05 03:46:38 +00:00
Cyrus Leung
bbc3619dc8
[Core] Make encoder-decoder inputs a nested structure to be more composable ( #9604 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-05 10:07:31 +08:00
Tyler Michael Smith
04bbf38e05
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep ( #9994 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-05 01:08:21 +00:00
Michael Goin
8f0a9ca890
[Bugfix] Respect modules_to_not_convert within awq_marlin ( #9895 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-04 16:57:44 -07:00
youkaichao
2094062b4e
[4.5/N] bugfix for quant config in speculative decode ( #10007 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-04 15:11:59 -08:00
bnellnm
d93478b399
[Bugfix] Upgrade to pytorch 2.5.1 ( #10001 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-04 15:11:28 -08:00
tomeras91
ac04a97a9f
[Frontend] Add max_tokens prometheus metric ( #9881 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-11-04 22:53:24 +00:00
lkchen
9a5664d4a4
[Misc] Refactor benchmark_throughput.py ( #9779 )
...
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net >
Co-authored-by: Linkun Chen <lkchen@github.com >
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net >
2024-11-04 14:32:16 -08:00
Robert Shaw
04cef2c6ab
[Bugfix] Fix MQLLMEngine hanging ( #9973 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-11-04 16:01:43 -05:00
Roger Wang
6e056bcf04
[Doc] Update VLM doc about loading from local files ( #9999 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-04 19:47:11 +00:00
hissu-hyvarinen
5208dc7a20
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs ( #9279 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2024-11-04 11:37:46 -08:00
Robert Shaw
1c45f4c385
[CI] Basic Integration Test For TPU ( #9968 )
...
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
2024-11-04 11:34:26 -08:00
Mor Zusman
603a661ae8
[Model] factoring out MambaMixer out of Jamba ( #8993 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-11-04 18:00:00 +00:00
Jee Jee Li
fb2716d641
[Misc]Reduce BNB static variable ( #9987 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-04 17:04:40 +00:00
youkaichao
8d72bb20fa
[4/N] make quant config first-class citizen ( #9978 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-04 08:51:31 -08:00
Chauncey
ac6b8f19b9
[Frontend] Multi-Modality Support for Loading Local Image Files ( #9915 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-04 15:34:57 +00:00
Mengqing Cao
ccb5376a9a
[Bugfix][OpenVINO] Fix circular reference #9939 ( #9974 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-04 18:14:13 +08:00
Tran Quang Dai
ea4adeddc1
[Bugfix] Fix E2EL mean and median stats ( #9984 )
...
Signed-off-by: daitran2k1 <tranquangdai7a@gmail.com >
2024-11-04 09:37:58 +00:00
Yang Zheng
4dbcbbeb09
[Misc] Compute query_start_loc/seq_start_loc on CPU ( #9447 )
...
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com >
2024-11-04 08:54:37 +00:00
Gregory Shtrasberg
b67feb1274
[Bugfix]Using the correct type hints ( #9885 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-11-04 06:19:51 +00:00
Jee Jee Li
c49f0407ba
[Bugfix] Fix MiniCPMV and Mllama BNB bug ( #9917 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-04 03:36:41 +00:00
Robert Shaw
91c9ebbb1b
[V1] Fix Configs ( #9971 )
2024-11-04 00:24:40 +00:00
shanshan wang
54597724f4
[Model] Add support for H2OVL-Mississippi models ( #9747 )
...
Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-11-04 00:15:36 +00:00
Nick Hill
1f1b6d6eda
[V1] Support per-request seed ( #9945 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
2024-11-03 09:14:17 -08:00
youkaichao
3bb4befea7
[bugfix] fix tsts ( #9959 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 15:54:05 -07:00
Yongzao
ae5279a163
[torch.compile] Adding torch compile to vision-language models ( #9946 )
2024-11-02 12:56:05 -07:00
Nikita Furin
1b73ab2a1f
[CI/Build] Quoting around > ( #9956 )
2024-11-02 12:50:28 -07:00
youkaichao
cea808f325
[3/N] model runner pass the whole config to model ( #9958 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 12:08:49 -07:00
youkaichao
74b529ceee
[bugfix] fix chatglm dummy_data_for_glmv ( #9955 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 08:03:33 -07:00
Robert Shaw
d6459b4516
[V1] Fix EngineArgs refactor on V1 ( #9954 )
2024-11-02 07:44:38 -07:00
youkaichao
e893795443
[2/N] executor pass the complete config to worker/modelrunner ( #9938 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2024-11-02 07:35:05 -07:00
Michael Green
1d4cfe2be1
[Doc] Updated tpu-installation.rst with more details ( #9926 )
...
Signed-off-by: Michael Green <mikegre@google.com >
2024-11-02 10:06:45 -04:00
Nick Hill
eed92f12fc
[Docs] Update Granite 3.0 models in supported models table ( #9930 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-02 09:02:18 +00:00
youkaichao
af7380d83b
[torch.compile] fix cpu broken code ( #9947 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 23:35:47 -07:00
sroy745
a78dd3303e
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models ( #9559 )
2024-11-01 23:22:49 -07:00
Kevin H. Luu
d522034c85
[ci/build] Have dependabot ignore pinned dependencies ( #9935 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-11-01 23:56:13 +00:00
Peter Salas
6c0b7f548d
[Core][VLM] Add precise multi-modal placeholder tracking ( #8346 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2024-11-01 16:21:10 -07:00
dependabot[bot]
d151fde834
[ci/build] Bump the patch-update group with 10 updates ( #9897 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-11-01 23:04:42 +00:00
Gene Der Su
27cd36e6e2
[Bugfix] PicklingError on RayTaskError ( #9934 )
...
Signed-off-by: Gene Su <e870252314@gmail.com >
2024-11-01 22:08:23 +00:00
youkaichao
18bd7587b7
[1/N] pass the complete config from engine to executor ( #9933 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 13:51:57 -07:00
Pavani Majety
598b6d7b07
[Bugfix/Core] Flashinfer k_scale and v_scale ( #9861 )
2024-11-01 12:15:05 -07:00
youkaichao
aff1fd8188
[torch.compile] use interpreter with stable api from pytorch ( #9889 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 11:50:37 -07:00
André Jonasson
4581d2cc02
[Core] Refactor: Clean up unused argument in Scheduler._preempt ( #9696 )
...
Signed-off-by: André Jonasson <andre.jonasson@gmail.com >
2024-11-01 11:41:38 -07:00
Travis Johnson
1dd4cb2935
[Bugfix] Fix edge cases for MistralTokenizer ( #9625 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2024-11-01 10:33:15 -07:00
Cyrus Leung
ba0d892074
[Frontend] Use a proper chat template for VLM2Vec ( #9912 )
2024-11-01 14:09:07 +00:00
Michael Goin
30a2e80742
[CI/Build] Add Model Tests for PixtralHF ( #9813 )
2024-11-01 07:55:29 -06:00
Cyrus Leung
06386a64dd
[Frontend] Chat-based Embeddings API ( #9759 )
2024-11-01 08:13:35 +00:00
Cyrus Leung
d3aa2a8b2f
[Doc] Update multi-input support ( #9906 )
2024-11-01 07:34:49 +00:00
Yongzao
2b5bf20988
[torch.compile] Adding torch compile annotations to some models ( #9876 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-01 00:25:47 -07:00
Michael Goin
93a76dd21d
[Model] Support bitsandbytes for MiniCPMV ( #9891 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-01 13:31:56 +08:00
youkaichao
566cd27797
[torch.compile] rework test plans ( #9866 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-31 22:20:17 -07:00
Michael Goin
37a4947dcd
[Bugfix] Fix layer skip logic with bitsandbytes ( #9887 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-01 13:12:44 +08:00
youkaichao
96e0c9cbbd
[torch.compile] directly register custom op ( #9896 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-31 21:56:09 -07:00
Joe Runde
031a7995f3
[Bugfix][Frontend] Reject guided decoding in multistep mode ( #9892 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-01 01:09:46 +00:00
Kevin H. Luu
b63c64d95b
[ci/build] Configure dependabot to update pip dependencies ( #9811 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-31 15:55:38 -07:00
Mor Zusman
9fb12f7848
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 ( #9838 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-10-31 20:06:25 +00:00
sasha0552
55650c83a0
[Bugfix] Fix illegal memory access error with chunked prefill, prefix caching, block manager v2 and xformers enabled together ( #9532 )
...
Signed-off-by: sasha0552 <admin@sasha0552.org >
2024-10-31 11:46:36 -07:00
Alexei-V-Ivanov-AMD
77f7ef2908
[CI/Build] Adding a forced docker system prune to clean up space ( #9849 )
2024-11-01 01:02:58 +08:00
Alex Brooks
16b8f7a86f
[CI/Build] Add Model Tests for Qwen2-VL ( #9846 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-31 09:10:52 -07:00
Jee Jee Li
5608e611c2
[Doc] Update Qwen documentation ( #9869 )
2024-10-31 08:54:18 +00:00
Roger Wang
3ea2dc2ec4
[Misc] Remove deprecated arg for cuda graph capture ( #9864 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-10-31 07:22:07 +00:00
Michael Goin
d087bf863e
[Model] Support quantization of Qwen2VisionTransformer ( #9817 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-30 22:41:20 -07:00
Kevin H. Luu
890ca36072
Revert "[Bugfix] Use host argument to bind to interface ( #9798 )" ( #9852 )
2024-10-31 01:44:51 +00:00
Guillaume Calmettes
abbfb6134d
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint ( #9837 )
2024-10-30 18:15:56 -07:00
youkaichao
64384bbcdf
[torch.compile] upgrade tests ( #9858 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-30 16:34:22 -07:00
Yongzao
00d91c8a2c
[CI/Build] Simplify exception trace in api server tests ( #9787 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-30 14:52:05 -07:00
youkaichao
c2cd1a2142
[doc] update pp support ( #9853 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-30 13:36:51 -07:00
Harsha vardhan manoj Bikki
c787f2d81d
[Neuron] Update Dockerfile.neuron to fix build failure ( #9822 )
2024-10-30 12:22:02 -07:00
Joe Runde
33d257735f
[Doc] link bug for multistep guided decoding ( #9843 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-30 17:28:29 +00:00
Joe Runde
3b3f1e7436
[Bugfix][core] replace heartbeat with pid check ( #9818 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-30 09:34:07 -07:00
Elfie Guo
9ff4511e43
[Misc] Add chunked-prefill support on FlashInfer. ( #9781 )
2024-10-30 09:33:53 -07:00
Went-Liang
81f09cfd80
[Model] Support math-shepherd-mistral-7b-prm model ( #9697 )
...
Signed-off-by: Went-Liang <wenteng_liang@163.com >
2024-10-30 09:33:42 -07:00
Alex Brooks
cc98f1e079
[CI/Build] VLM Test Consolidation ( #9372 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-30 09:32:17 -07:00
Woosuk Kwon
211fe91aa8
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA ( #9438 )
2024-10-30 09:41:38 +00:00
Jee Jee Li
6aa6020f9b
[Misc] Specify minimum pynvml version ( #9827 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-29 23:05:43 -07:00
youkaichao
ff5ed6e1bc
[torch.compile] rework compile control with piecewise cudagraph ( #9715 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 23:03:49 -07:00
Russell Bryant
7b0365efef
[Doc] Add the DCO to CONTRIBUTING.md ( #9803 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-30 05:22:23 +00:00
Yan Ma
04a3ae0aca
[Bugfix] Fix multi nodes TP+PP for XPU ( #8884 )
...
Signed-off-by: YiSheng5 <syhm@mail.ustc.edu.cn >
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn >
2024-10-29 21:34:45 -07:00
Kevin H. Luu
62fac4b9aa
[ci/build] Pin CI dependencies version with pip-compile ( #9810 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-30 03:34:55 +00:00
Michael Goin
226688bd61
[Bugfix][VLM] Make apply_fp8_linear work with >2D input ( #9812 )
2024-10-29 19:49:44 -07:00
Lily Liu
64cb1cdc3f
Update README.md ( #9819 )
2024-10-29 17:28:43 -07:00
youkaichao
1ab6f6b4ad
[core][distributed] fix custom allreduce in pytorch 2.5 ( #9815 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 17:06:24 -07:00
Michael Goin
bc73e9821c
[Bugfix] Fix prefix strings for quantized VLMs ( #9772 )
2024-10-29 16:02:59 -07:00
Simon Mo
8d7724104a
[Docs] Add notes about Snowflake Meetup ( #9814 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-10-29 15:19:02 -07:00
Will Eaton
882a1ad0de
[Model] tool calling support for ibm-granite/granite-20b-functioncalling ( #8339 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com >
2024-10-29 15:07:37 -07:00
Joe Runde
67bdf8e523
[Bugfix][Frontend] Guard against bad token ids ( #9634 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-29 14:13:20 -07:00
Kunjan
0ad216f575
[MISC] Set label value to timestamp over 0, to keep track of recent history ( #9777 )
...
Signed-off-by: Kunjan Patel <kunjanp@google.com >
2024-10-29 19:52:19 +00:00
Russell Bryant
7585ec996f
[CI/Build] mergify: fix rules for ci/build label ( #9804 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-29 19:24:42 +00:00
Michael Goin
ab6f981671
[CI][Bugfix] Skip chameleon for transformers 4.46.1 ( #9808 )
2024-10-29 11:12:43 -07:00
Junichi Sato
ac3d748dba
[Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel ( #9806 )
2024-10-29 10:40:35 -07:00
yannicks1
0ce7798f44
[Misc]: Typo fix: Renaming classes (casualLM -> causalLM) ( #9801 )
...
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com >
2024-10-29 10:39:20 -07:00
Sven Seeberg
0f43387157
[Bugfix] Use host argument to bind to interface ( #9798 )
2024-10-29 10:37:59 -07:00
tastelikefeet
08600ddc68
Fix the log to correct guide user to install modelscope ( #9793 )
...
Signed-off-by: yuze.zyz <yuze.zyz@alibaba-inc.com >
2024-10-29 10:36:59 -07:00
科英
74fc2d77ae
[Misc] Add metrics for request queue time, forward time, and execute time ( #9659 )
2024-10-29 10:32:56 -07:00
wangshuai09
622b7ab955
[Hardware] using current_platform.seed_everything ( #9785 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-29 14:47:44 +00:00
Isotr0py
09500f7dde
[Model] Add BNB quantization support for Mllama ( #9720 )
2024-10-29 08:20:02 -04:00
Zhong Qishuai
ef7865b4f9
[Frontend] re-enable multi-modality input in the new beam search implementation ( #9427 )
...
Signed-off-by: Qishuai Ferdinandzhong@gmail.com
2024-10-29 11:49:47 +00:00
Cyrus Leung
eae3d48181
[Bugfix] Use temporary directory in registry ( #9721 )
2024-10-28 22:08:20 -07:00
Cyrus Leung
e74f2d448c
[Doc] Specify async engine args in docs ( #9726 )
2024-10-28 22:07:57 -07:00
Jee Jee Li
7a4df5f200
[Model][LoRA]LoRA support added for Qwen ( #9622 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-29 04:14:07 +00:00
Russell Bryant
c5d7fb9ddc
[Doc] fix third-party model example ( #9771 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-28 19:39:21 -07:00
youkaichao
76ed5340f0
[torch.compile] add deepseek v2 compile ( #9775 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-28 14:35:17 -07:00
youkaichao
97b61bfae6
[misc] avoid circular import ( #9765 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-28 20:51:23 +00:00
Yongzao
aa0addb397
Adding "torch compile" annotations to moe models ( #9758 )
2024-10-28 13:49:56 -07:00
litianjian
5f8d8075f9
[Model][VLM] Add multi-video support for LLaVA-Onevision ( #8905 )
...
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-28 18:04:10 +00:00
Russell Bryant
8b0e4f2ad7
[CI/Build] Adopt Mergify for auto-labeling PRs ( #9259 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-28 09:38:09 -07:00
Yan Ma
2adb4409e0
[Bugfix] Fix ray instance detect issue ( #9439 )
2024-10-28 07:13:03 +00:00
Robert Shaw
feb92fbe4a
Fix beam search eos ( #9627 )
2024-10-28 06:59:37 +00:00
youkaichao
32176fee73
[torch.compile] support moe models ( #9632 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 21:58:04 -07:00
wangshuai09
4e2d95e372
[Hardware][ROCM] using current_platform.is_rocm ( #9642 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-28 04:07:00 +00:00
madt2709
34a9941620
[Bugfix] Fix load config when using bools ( #9533 )
2024-10-27 13:46:41 -04:00
Harry Mellor
e130c40e4e
Fix cache management in "Close inactive issues and PRs" actions workflow ( #9734 )
2024-10-27 10:30:03 -07:00
bnellnm
3cb07a36a2
[Misc] Upgrade to pytorch 2.5 ( #9588 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-27 09:44:24 +00:00
youkaichao
8549c82660
[core] cudagraph output with tensor weak reference ( #9724 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 00:19:28 -07:00
科英
67a6882da4
[Misc] SpecDecodeWorker supports profiling ( #9719 )
...
Signed-off-by: Abatom <abatom@163.com >
2024-10-27 04:18:03 +00:00
kakao-kevin-us
6650e6a930
[Model] Add classification Task with Qwen2ForSequenceClassification ( #9704 )
...
Signed-off-by: Kevin-Yang <ykcha9@gmail.com >
Co-authored-by: Kevin-Yang <ykcha9@gmail.com >
2024-10-26 17:53:35 +00:00
Vasiliy Alekseev
07e981fdf4
[Frontend] Bad words sampling parameter ( #9717 )
...
Signed-off-by: Vasily Alexeev <alvasian@yandex.ru >
2024-10-26 16:29:38 +00:00
ErkinSagiroglu
55137e8ee3
Fix: MI100 Support By Bypassing Custom Paged Attention ( #9560 )
2024-10-26 12:12:57 +00:00
Mengqing Cao
5cbdccd151
[Hardware][openvino] is_openvino --> current_platform.is_openvino ( #9716 )
2024-10-26 10:59:06 +00:00
Sam Stoelinga
067e77f9a8
[Bugfix] Steaming continuous_usage_stats default to False ( #9709 )
...
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com >
2024-10-26 05:05:47 +00:00
Travis Johnson
6567e13724
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding ( #9631 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: pavlo-ruban <pavlo.ruban@servicenow.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-25 15:42:56 -07:00
Rafael Vasquez
228cfbd03f
[Doc] Improve quickstart documentation ( #9256 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-25 14:32:10 -07:00
Michael Goin
ca0d92227e
[Bugfix] Fix compressed_tensors_moe bad config.strategy ( #9677 )
2024-10-25 12:40:33 -07:00
Woosuk Kwon
9645b9f646
[V1] Support sliding window attention ( #9679 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-10-24 22:20:37 -07:00
Will Johnson
a6f3721861
[Model] add a lora module for granite 3.0 MoE models ( #9673 )
2024-10-24 22:00:17 -07:00
Kevin H. Luu
9f7b4ba865
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 ( #9676 )
2024-10-24 20:59:00 -07:00
Michael Goin
c91ed47c43
[Bugfix] Remove xformers requirement for Pixtral ( #9597 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 15:38:05 -07:00
Charlie Fu
59449095ab
[Performance][Kernel] Fused_moe Performance Improvement ( #9384 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2024-10-24 15:37:52 -07:00
Michael Goin
e26d37a185
[Log][Bugfix] Fix default value check for image_url.detail ( #9663 )
2024-10-24 10:44:38 -07:00
Alex Brooks
722d46edb9
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints ( #9650 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-24 10:42:24 -07:00
Cyrus Leung
c866e0079d
[CI/Build] Fix VLM test failures when using transformers v4.46 ( #9666 )
2024-10-25 01:40:40 +08:00
Yongzao
d27cfbf791
[torch.compile] Adding torch compile annotations to some models ( #9641 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 09:31:42 -07:00
Harry Mellor
de662d32b5
Increase operation per run limit for "Close inactive issues and PRs" workflow ( #9661 )
...
Signed-off-by: Harry Mellor <hej.mellor@gmail.com >
2024-10-24 12:17:45 -04:00
litianjian
f58454968f
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models ( #9653 )
2024-10-24 07:52:07 -07:00
Cyrus Leung
b979143d5b
[Doc] Move additional tips/notes to the top ( #9647 )
2024-10-24 09:43:59 +00:00
Yongzao
ad6f78053e
[torch.compile] expanding support and fix allgather compilation ( #9637 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 01:32:15 -07:00
Jee Jee Li
295a061fb3
[Kernel] add kernel for FATReLU ( #9610 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-24 16:18:27 +08:00
Yongzao
8a02cd045a
[torch.compile] Adding torch compile annotations to some models ( #9639 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 00:54:57 -07:00
youkaichao
4fdc581f9e
[core] simplify seq group code ( #9569 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-10-24 00:16:44 -07:00
Woosuk Kwon
3770071eb4
[V1][Bugfix] Clean up requests when aborted ( #9629 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-10-23 23:33:22 -07:00
Cyrus Leung
836e8ef6ee
[Bugfix] Fix PP for ChatGLM and Molmo ( #9422 )
2024-10-24 06:12:05 +00:00
Yan Ma
056a68c7db
[XPU] avoid triton import for xpu ( #9440 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-24 05:14:00 +00:00
Vinay R Damodaran
33bab41060
[Bugfix]: Make chat content text allow type content ( #9358 )
...
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
2024-10-24 05:05:49 +00:00
Michael Goin
b7df53cd42
[Bugfix] Use "vision_model" prefix for MllamaVisionModel ( #9628 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 10:07:44 +08:00
Michael Goin
bb01f2915e
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image ( #9626 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 10:03:44 +08:00
Russell Bryant
b548d7a5f4
[CI/Build] Add bot to close stale issues and PRs ( #9436 )
2024-10-23 15:45:26 -07:00
Yunfei Chu
fc6c274626
[Model] Add Qwen2-Audio model support ( #9248 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-23 17:54:22 +00:00
Alex Brooks
150b779081
[Frontend] Enable Online Multi-image Support for MLlama ( #9393 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-23 17:28:57 +00:00
Yongzao
9013e24f7b
[torch.compile] Adding torch compile annotations to some models ( #9614 )
2024-10-23 10:07:48 -07:00
Michael Goin
fd0e2cfdb2
[Misc] Separate total and output tokens in benchmark_throughput.py ( #8914 )
2024-10-23 16:47:20 +00:00
Tyler Michael Smith
e5ac6a4199
[Bugfix] Fix divide by zero when serving Mamba models ( #9617 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-10-23 16:40:43 +00:00
youkaichao
dbdd3b5e5a
[misc] comment to avoid future confusion about baichuan ( #9620 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-23 09:14:44 -07:00
Cyrus Leung
e7116c017c
[Bugfix] Fix _init_vision_model in NVLM_D model ( #9611 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-23 14:09:04 +00:00
Alex Brooks
31a08f5bd2
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs ( #9612 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-23 14:05:18 +00:00
Cyrus Leung
c18e1a3418
[VLM] Enable overriding whether post layernorm is used in vision encoder + fix quant args ( #9217 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-23 11:27:37 +00:00
Isotr0py
3ff57ebfca
[Model] Initialize Florence-2 language backbone support ( #9555 )
2024-10-23 10:42:47 +00:00
Mengqing Cao
2394962d70
[Hardware][XPU] using current_platform.is_xpu ( #9605 )
2024-10-23 08:28:21 +00:00
Luka Govedič
51c24c9736
[Build] Fix FetchContent multiple build issue ( #9596 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2024-10-23 12:43:07 +08:00
Cyrus Leung
831540cf04
[Model] Support E5-V ( #9576 )
2024-10-23 11:35:29 +08:00
Flex Wang
29061ed9df
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages ( #9590 )
2024-10-23 11:17:28 +08:00
Chen Zhang
65050a40e6
[Bugfix] Generate exactly input_len tokens in benchmark_throughput ( #9592 )
2024-10-22 17:45:35 -07:00
Seth Kimmel
208cb34c81
[Doc]: Update tensorizer docs to include vllm[tensorizer] ( #7889 )
...
Co-authored-by: Kaunil Dhruv <dhruv.kaunil@gmail.com >
2024-10-22 15:43:25 -07:00
yulei
b17046e298
[BugFix] Fix metrics error for --num-scheduler-steps > 1 ( #8234 )
2024-10-22 15:43:03 -07:00
Lucas Wilkinson
d1e8240875
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing ( #9487 )
2024-10-22 15:41:13 -07:00
Jeremy Arnold
cb6fdaa0a0
[Misc] Make benchmarks use EngineArgs ( #9529 )
2024-10-22 15:40:38 -07:00
Aurick Qiao
23b899a8e6
[Bugfix] fix detokenizer shallow copy ( #5919 )
2024-10-22 15:38:12 -07:00
youkaichao
17c79f3c36
[torch.compile] auto infer dynamic_arg_dims from type annotation ( #9589 )
2024-10-22 13:43:37 -07:00
Ronen Schaffer
cd5601ac37
[BugFix] Prevent exporting duplicate OpenTelemetry spans ( #9017 )
2024-10-22 11:11:53 -07:00
Yuhong Guo
434984e665
[Frontend] Support custom request_id from request ( #9550 )
...
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com >
2024-10-22 18:07:30 +00:00
Yuan
32a1ee74a0
[Hardware][Intel CPU][DOC] Update docs for CPU backend ( #6212 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Gubrud, Aaron D <aaron.d.gubrud@intel.com >
Co-authored-by: adgubrud <96072084+adgubrud@users.noreply.github.com >
2024-10-22 10:38:04 -07:00
gopalsarda
08075c3448
[Bugfix] Eagle: change config name for fc bias ( #9580 )
2024-10-22 16:14:22 +00:00
Isotr0py
bb392ea2d2
[Model][VLM] Initialize support for Mono-InternVL model ( #9528 )
2024-10-22 16:01:46 +00:00
xendo
9dbcce84a7
[Neuron] [Bugfix] Fix neuron startup ( #9374 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-22 12:51:41 +00:00
Jee Jee Li
a48e3ec052
[CI/Build][LoRA] Temporarily fix long context failure issue ( #9579 )
2024-10-22 11:32:51 +00:00
Woosuk Kwon
6c5af09b39
[V1] Implement vLLM V1 [1/N] ( #9289 )
2024-10-22 01:24:07 -07:00
wangshuai09
3ddbe25502
[Hardware][CPU] using current_platform.is_cpu ( #9536 )
2024-10-22 00:50:43 -07:00
chenqianfzh
0d02747f2e
support TP in qwen2 bnb ( #9574 )
2024-10-22 07:13:23 +00:00
Rafael Vasquez
f7db5f0fa9
[Doc] Use shell code-blocks and fix section headers ( #9508 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-22 06:43:24 +00:00
Kuntai Du
ca30c3c84b
[Core] Remove evictor_v1 ( #9572 )
2024-10-22 04:55:49 +00:00
Wallas Henrique
c0292211ce
[CI/Build] Replaced some models on tests for smaller ones ( #9570 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-22 04:52:14 +00:00
Falko1
74692421f7
[Bugfix]: phi.py get rope_theta from config file ( #9503 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-22 02:53:36 +00:00
ngrozae
29acd2c34c
[Bugfix][OpenVINO] fix_dockerfile_openvino ( #9552 )
2024-10-21 19:47:52 -07:00
Cyrus Leung
f085995a7b
[CI/Build] Remove unnecessary fork_new_process ( #9484 )
2024-10-21 19:47:29 -07:00
Travis Johnson
b729901139
[Bugfix]: serialize config by value for --trust-remote-code ( #6751 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-21 19:46:24 -07:00
youkaichao
76a5e13270
[core] move parallel sampling out from vllm core ( #9302 )
2024-10-22 00:31:44 +00:00
Joe Runde
ef7faad1b8
🐛 Fixup more test failures from memory profiling ( #9563 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-21 17:10:56 -07:00
Kuntai Du
575dcebe9a
[CI] Make format checker error message more user-friendly by using emoji ( #9564 )
...
This PR makes format checker error message more user-friendly by adding emojis.
2024-10-21 23:45:15 +00:00
Wallas Henrique
711f3a7806
[Frontend] Don't log duplicate error stacktrace for every request in the batch ( #9023 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-21 14:49:41 -07:00
Nick Hill
15713e3b75
[BugFix] Update draft model TP size check to allow matching target TP size ( #9394 )
...
Co-authored-by: Baoyuan Qi <qibaoyuan@126.com >
2024-10-21 14:14:29 -07:00
youkaichao
d621c43df7
[doc] fix format ( #9562 )
2024-10-21 13:54:57 -07:00
Nick Hill
9d9186be97
[Frontend] Reduce frequency of client cancellation checking ( #7959 )
2024-10-21 13:28:10 -07:00
Michael Goin
5241aa1494
[Model][Bugfix] Fix batching with multi-image in PixtralHF ( #9518 )
2024-10-21 14:20:07 -04:00
Varad Ahirwadkar
ec6bd6c4c6
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint ( #9492 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-21 17:43:02 +00:00
yudian0504
8ca8954841
[Bugfix][Misc]: fix graph capture for decoder ( #9549 )
2024-10-21 17:33:30 +00:00
Dhia Eddine Rhaiem
f6b97293aa
[Model] FalconMamba Support ( #9325 )
2024-10-21 12:50:16 -04:00
Thomas Parnell
496e991da8
[Doc] Consistent naming of attention backends ( #9498 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-10-21 22:29:57 +08:00
Cyrus Leung
696b01af8f
[CI/Build] Split up decoder-only LM tests ( #9488 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-20 21:27:50 -07:00
Andy Dai
855e0e6f97
[Frontend][Misc] Goodput metric support ( #9338 )
2024-10-20 18:39:32 +00:00
Chen Zhang
4fa3e33349
[Kernel] Support sliding window in flash attention backend ( #9403 )
2024-10-20 10:57:52 -07:00
Michael Goin
962d2c6349
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision ( #9520 )
2024-10-20 05:29:14 +00:00
Chen Zhang
5b59fe0f08
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger ( #9530 )
2024-10-20 00:05:02 +00:00
Michael Goin
8e3e7f2713
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf ( #9514 )
2024-10-19 10:44:29 -04:00
Cyrus Leung
263d8ee150
[Bugfix] Fix missing task for speculative decoding ( #9524 )
2024-10-19 06:49:40 +00:00
Yue Zhang
c5eea3c8ba
[Frontend] Support simpler image input format ( #9478 )
2024-10-18 23:17:07 -07:00
Russell Bryant
85dc92fc98
[CI/Build] Configure matcher for actionlint workflow ( #9511 )
...
Signed-off-by: Russell Bryant <russell.bryant@gmail.com >
2024-10-19 06:04:18 +00:00
Russell Bryant
dfd951ed9b
[CI/Build] Add error matching for ruff output ( #9513 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-19 05:42:20 +00:00
Joe Runde
82c25151ec
[Doc] update gpu-memory-utilization flag docs ( #9507 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-19 11:26:36 +08:00
Nick Hill
1325872ec8
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily ( #9521 )
2024-10-18 20:21:01 -07:00
Joe Runde
380e18639f
🐛 fix torch memory profiling ( #9516 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-18 21:25:19 -04:00
sasha0552
337ed76671
[Bugfix] Fix offline mode when using mistral_common ( #9457 )
2024-10-18 18:12:32 -07:00
Thomas Parnell
0c9a5258f9
[Kernel] Add env variable to force flashinfer backend to enable tensor cores ( #9497 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-18 17:55:48 -07:00
Cody Yu
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py ( #9510 )
2024-10-18 14:30:55 -07:00
Kunjan
9bb10a7d27
[MISC] Add lora requests to metrics ( #9477 )
...
Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a .c.kunjanp-gke-dev-2.internal>
2024-10-18 20:50:18 +00:00
Michael Goin
3921a2f29e
[Model] Support Pixtral models in the HF Transformers format ( #9036 )
2024-10-18 13:29:56 -06:00
Russell Bryant
67a7e5ef38
[CI/Build] Add error matching config for mypy ( #9512 )
2024-10-18 12:17:53 -07:00
Cyrus Leung
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding ( #9424 )
2024-10-18 11:31:58 -07:00
Russell Bryant
7dbe738d65
[Misc] benchmark: Add option to set max concurrency ( #9390 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-18 11:15:28 -07:00
Tyler Michael Smith
ae8b633ba3
[Bugfix] Fix offline_inference_with_prefix.py ( #9505 )
2024-10-18 16:59:19 +00:00
Cyrus Leung
1bbbcc0b1d
[CI/Build] Fix lint errors in mistral tokenizer ( #9504 )
2024-10-19 00:09:35 +08:00
Nick Hill
25aeb7d4c9
[BugFix] Fix and simplify completion API usage streaming ( #9475 )
2024-10-18 14:10:26 +00:00
tomeras91
d2b1bf55ec
[Frontend][Feature] Add jamba tool parser ( #9154 )
2024-10-18 10:27:48 +00:00
Nick Hill
1ffc8a7362
[BugFix] Typing fixes to RequestOutput.prompt and beam search ( #9473 )
2024-10-18 07:19:53 +00:00
Russell Bryant
944dd8edaf
[CI/Build] Use commit hash references for github actions ( #9430 )
2024-10-17 21:54:58 -07:00
Haoyu Wang
154a8ae880
[Qwen2.5] Support bnb quant for Qwen2.5 ( #9467 )
2024-10-18 04:40:14 +00:00
Joe Runde
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage ( #9352 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-17 22:47:27 -04:00
Dipika Sikka
48138a8415
[BugFix] Stop silent failures on compressed-tensors parsing ( #9381 )
2024-10-17 18:54:00 -07:00
Robert Shaw
343f8e0905
Support BERTModel (first encoder-only embedding model) ( #9056 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: laishzh <laishengzhang@gmail.com >
Co-authored-by: Max de Bayser <maxdebayser@gmail.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-10-17 23:21:01 +00:00
Shashwat Srijan
bb76538bbd
[Hardwware][Neuron] Simplify model load for transformers-neuronx library ( #9380 )
2024-10-17 15:39:39 -07:00
sasha0552
d615b5c9f8
[Bugfix] Print warnings related to mistral_common tokenizer only once ( #9468 )
2024-10-17 21:44:20 +00:00
Kai Wu
d65049daab
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script ( #9013 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-17 21:11:11 +00:00
bnellnm
eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType ( #9299 )
2024-10-17 19:08:34 +00:00
Luka Govedič
0f41fbe5a3
[torch.compile] Fine-grained CustomOp enabling mechanism ( #9300 )
2024-10-17 18:36:37 +00:00
Cyrus Leung
7871659abb
[Misc] Remove commit id file ( #9470 )
2024-10-17 10:34:37 -07:00
Daniele
a2c71c5405
[CI/Build] remove .github from .dockerignore, add dirty repo check ( #9375 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-10-17 10:25:06 -07:00
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
Li, Jiang
5eda21e773
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support ( #9344 )
2024-10-17 12:21:04 -04:00
Woosuk Kwon
8e1cddcd44
[TPU] Call torch._sync(param) during weight loading ( #9437 )
2024-10-17 09:00:11 -07:00
sasha0552
5e443b594f
[Bugfix] Allow prefill of assistant response when using mistral_common ( #9446 )
2024-10-17 15:06:37 +00:00
Lucas Wilkinson
9d30a056e7
[misc] CUDA Time Layerwise Profiler ( #8337 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-10-17 10:36:09 -04:00
Cyrus Leung
390be74649
[Misc] Print stack trace using logger.exception ( #9461 )
2024-10-17 13:55:48 +00:00
Lucas Wilkinson
e312e52b44
[Kernel] Add Exllama as a backend for compressed-tensors ( #9395 )
2024-10-17 09:48:26 -04:00
Yuan Tang
dbfa8d31d5
Add notes on the use of Slack ( #9442 )
2024-10-17 04:46:46 +00:00
rasmith
92d86da217
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels ( #9391 )
2024-10-17 01:34:06 +00:00
Tyler Michael Smith
c3fab5f769
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel ( #9425 )
2024-10-16 23:46:06 +00:00
Russell Bryant
776dbd74f1
[CI/Build] mypy: Resolve some errors from checking vllm/engine ( #9267 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-16 22:55:59 +00:00
Lily Liu
8345045833
[Performance][Spec Decode] Optimize ngram lookup performance ( #9333 )
2024-10-16 13:37:45 -06:00
Junhao Li
5b8a1fde84
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft ( #9396 )
2024-10-16 16:40:24 +00:00
Mor Zusman
fb60ae9b91
[Kernel][Model] Improve continuous batching for Jamba and Mamba ( #9189 )
2024-10-16 12:12:43 -04:00
Patrick von Platen
415f76a9cb
Support mistral interleaved attn ( #9414 )
2024-10-16 13:28:30 +00:00
Isotr0py
cf1d62a644
[Model] Support SDPA attention for Molmo vision backbone ( #9410 )
2024-10-16 11:52:01 +00:00
Roger Wang
59230ef32b
[Misc] Consolidate example usage of OpenAI client for multimodal models ( #9412 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-16 11:20:51 +00:00
Cyrus Leung
cee711fdbb
[Core] Rename input data types ( #8688 )
2024-10-16 10:49:37 +00:00
Cyrus Leung
1de76a0e55
[CI/Build] Test VLM embeddings ( #9406 )
2024-10-16 09:44:30 +00:00
Cyrus Leung
7abba39ee6
[Model] VLM2Vec, the first multimodal embedding model in vLLM ( #9303 )
2024-10-16 14:31:00 +08:00
Cyrus Leung
7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL ( #9250 )
2024-10-16 13:56:17 +08:00
Reza Salehi
ed920135c8
[Bugfix] Molmo text-only input bug fix ( #9397 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-16 04:56:09 +00:00
Lucas Wilkinson
717a5f82cd
[Bugfix][CI/Build] Fix CUDA 11.8 Build ( #9386 )
2024-10-16 00:15:21 +00:00
Chang Su
ba30942240
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids ( #9034 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-15 15:40:43 -07:00
Michael Goin
22f8a69549
[Misc] Directly use compressed-tensors for checkpoint definitions ( #8909 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-15 15:40:25 -07:00
Grace Ho
5d264f4ab8
pass ignore_eos parameter to all benchmark_serving calls ( #9349 )
2024-10-15 13:30:44 -07:00
Nick Hill
e9d517f276
[BugFix] Fix chat API continuous usage stats ( #9357 )
2024-10-14 23:19:48 -07:00
hhzhang16
55e081fbad
[Bugfix] Update InternVL input mapper to support image embeds ( #9351 )
2024-10-14 21:29:19 -07:00
Michael Goin
8e836d982a
[Doc] Fix code formatting in spec_decode.rst ( #9348 )
2024-10-14 21:29:11 -07:00
Steve Grubb
44eaa5a5d9
[Frontend] Clarify model_type error messages ( #9345 )
2024-10-14 21:29:01 -07:00
Tyler Michael Smith
169b530607
[Bugfix] Clean up some cruft in mamba.py ( #9343 )
2024-10-15 00:24:25 +00:00
Xiang Xu
f0fe4fe86d
[Model] Make llama3.2 support multiple and interleaved images ( #9095 )
2024-10-14 15:24:26 -07:00
Brendan Wong
4d31cd424b
[Frontend] merge beam search implementations ( #9296 )
2024-10-14 15:05:52 -07:00
Woosuk Kwon
473e7b3606
[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel ( #9350 )
2024-10-14 15:02:06 -07:00
Simon Mo
fd47e57f4b
[Docs] Remove PDF build from Readtehdocs ( #9347 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-10-14 11:57:47 -07:00
Daniele
203ab8f80f
[CI/Build] setuptools-scm fixes ( #8900 )
2024-10-14 11:34:47 -07:00
Kunshang Ji
4141608c6a
[Hardware][intel GPU] add async output process for xpu ( #8897 )
2024-10-14 12:23:33 -06:00
Reza Salehi
dfe43a2071
[Model] Molmo vLLM Integration ( #9016 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-14 07:56:24 -07:00
Tyler Michael Smith
16b24e7dcd
[Bugfix] Bandaid fix for speculative decoding tests ( #9327 )
2024-10-13 23:02:11 +00:00
Lily Liu
f519902c52
[CI] Fix merge conflict ( #9317 )
2024-10-13 06:41:23 +00:00
Jee Jee Li
250e26a63e
[Bugfix]Fix MiniCPM's LoRA bug ( #9286 )
2024-10-12 09:36:47 -07:00
Yunmeng
2b184ddd4f
[Misc][Installation] Improve source installation script and doc ( #9309 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-12 09:36:40 -07:00
Xiang Xu
00298e092c
[Bugfix] Fix bug of xformer prefill for encoder-decoder ( #9026 )
2024-10-12 15:00:43 +08:00
Lily Liu
89feb4c84d
[SpecDec] Remove Batch Expansion (2/3) ( #9298 )
2024-10-12 05:13:37 +00:00
Maximilien de Bayser
ec10cb8511
[BugFix] Fix tool call finish reason in streaming case ( #9209 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-10-11 18:24:26 -07:00
Prashant Gupta
d11b46f3a5
[bugfix] fix f-string for error ( #9295 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-11 17:03:48 -07:00
Allen Wang
c6cf9295e1
[Bugfix] Sets is_first_step_output for TPUModelRunner ( #9202 )
2024-10-11 13:28:10 -07:00
Lucas Wilkinson
de9fb4bef8
[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected ( #9254 )
2024-10-11 15:57:39 -04:00
Wallas Henrique
8baf85e4e9
[Doc] Compatibility matrix for mutual exclusive features ( #8512 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-11 11:18:50 -07:00
homeffjy
1a1823871d
[Doc] Remove outdated comment to avoid misunderstanding ( #9287 )
2024-10-11 18:02:03 +00:00
sixgod
6cf1167c1a
[Model] Add GLM-4v support and meet vllm==0.6.2 ( #9242 )
2024-10-11 17:36:13 +00:00
Burkhard Ringlein
f710090d8e
[Kernel] adding fused moe kernel config for L40S TP4 ( #9245 )
2024-10-11 08:54:22 -07:00
Tyler Michael Smith
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
Sebastian Schoennenbeck
df3dcdf49d
[Bugfix] Fix priority in multiprocessing engine ( #9277 )
2024-10-11 15:35:35 +00:00
Jee Jee Li
36ea79079b
[Misc][LoRA] Support loading LoRA weights for target_modules in reg format ( #9275 )
2024-10-11 12:31:21 +00:00
Cyrus Leung
e808156f30
[Misc] Collect model support info in a single process per model ( #9233 )
2024-10-11 11:08:11 +00:00
youkaichao
cbc2ef5529
[misc] hide best_of from engine ( #9261 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-10-10 21:30:44 -07:00
Andy Dai
94bf9ae4e9
[Misc] Fix sampling from sonnet for long context case ( #9235 )
2024-10-11 00:33:16 +00:00
omrishiv
f990bab2a4
[Doc][Neuron] add note to neuron documentation about resolving triton issue ( #9257 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-10-10 23:36:32 +00:00
youkaichao
e00c094f15
[torch.compile] generic decorators ( #9258 )
2024-10-10 15:54:23 -07:00
Kevin H. Luu
a78c6ba7c8
[ci/build] Add placeholder command for custom models test ( #9262 )
2024-10-10 15:45:09 -07:00
dependabot[bot]
fb870fd491
Bump actions/setup-python from 3 to 5 ( #9195 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:46 -07:00
dependabot[bot]
270953bafb
Bump actions/checkout from 3 to 4 ( #9196 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:35 -07:00
dependabot[bot]
9cc811c4ff
Bump actions/github-script from 6 to 7 ( #9197 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:24 -07:00
youkaichao
e4d652ea3e
[torch.compile] integration with compilation control ( #9058 )
2024-10-10 12:39:36 -07:00
Simon Mo
78c0b4166c
Suggest codeowners for the core componenets ( #9210 )
2024-10-10 12:29:24 -07:00
jordanyono
21efb603f5
[CI/Build] Make the Dockerfile.cpu file's PIP_EXTRA_INDEX_URL Configurable as a Build Argument ( #9252 )
2024-10-10 18:18:18 +00:00
Rafael Vasquez
055f3270d4
[Doc] Improve debugging documentation ( #9204 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-10 10:48:51 -07:00
Lucas Wilkinson
18511aeda6
[Bugfix] Fix Machete unittests failing with NotImplementedError ( #9218 )
2024-10-10 17:39:56 +00:00
Ilya Lavrenov
83ea5c72b9
[OpenVINO] Use torch 2.4.0 and newer optimim version ( #9121 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-10 11:18:58 -06:00
whyiug
04de9057ab
[Model] support input image embedding for minicpmv ( #9237 )
2024-10-10 15:00:47 +00:00
Isotr0py
07c11cf4d4
[Bugfix] Fix lm_head weights tying with lora for llama ( #9227 )
2024-10-10 21:11:56 +08:00
sroy745
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 ( #9149 )
2024-10-10 14:17:17 +08:00
Lucas Wilkinson
a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) ( #9212 )
2024-10-10 14:16:17 +08:00
Michael Goin
ce00231a8b
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models ( #9213 )
2024-10-10 14:15:40 +08:00
youkaichao
de895f1697
[misc] improve model support check in another process ( #9208 )
2024-10-09 21:58:27 -07:00
Russell Bryant
cf25b93bdd
[Core] Fix invalid args to _process_request ( #9201 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-10 12:10:09 +08:00
Michael Goin
d5fbb8706d
[CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 ( #9130 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 12:51:47 -06:00
Russell Bryant
cdca8994bd
[CI/Build] mypy: check vllm/entrypoints ( #9194 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-09 17:15:28 +00:00
Li, Jiang
ca77dd7a44
[Hardware][CPU] Support AWQ for CPU backend ( #7515 )
2024-10-09 10:28:08 -06:00
Ewout ter Hoeven
7dea289066
Add Dependabot configuration for GitHub Actions updates ( #1217 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 08:16:26 -07:00
Cyrus Leung
cfaa6008e6
[Bugfix] Access get_vocab instead of vocab in tool parsers ( #9188 )
2024-10-09 08:59:57 -06:00
Ahmad Fahadh Ilyas
21906a6f50
[Bugfix] Fix lora loading for Compressed Tensors in #9120 ( #9179 )
2024-10-09 12:10:44 +00:00
Jiangtao Hu
dc4aea677a
[Doc] Fix VLM prompt placeholder sample bug ( #9170 )
2024-10-09 08:59:42 +00:00
youkaichao
c8627cd41b
[ci][test] use load dummy for testing ( #9165 )
2024-10-09 00:38:40 -07:00
Cyrus Leung
8bfaa4e31e
[Bugfix] fix composite weight loading and EAGLE weight loading ( #9160 )
2024-10-09 00:36:55 -07:00
AlpinDale
0b5b5d767e
[Frontend] Log the maximum supported concurrency ( #8831 )
2024-10-09 00:03:14 -07:00
Hui Liu
cdc72e3c80
[Model] Remap FP8 kv_scale in CommandR and DBRX ( #9174 )
2024-10-09 06:43:06 +00:00
Joe Rowell
7627172bf4
[Bugfix][Doc] Report neuron error in output ( #9159 )
2024-10-08 22:43:34 -07:00
Travis Johnson
480b7f40cf
[Misc] Improve validation errors around best_of and n ( #9167 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-10-09 04:54:48 +00:00
Yuan Tang
acce7630c1
Update link to KServe deployment guide ( #9173 )
2024-10-09 03:58:49 +00:00
Yuan Tang
ffc4b27ea8
Add classifiers in setup.py ( #9171 )
2024-10-08 19:30:48 -07:00
chenqianfzh
2f4117c38e
support bitsandbytes quantization with more models ( #9148 )
2024-10-08 19:52:19 -06:00
Michael Goin
9ba0bd6aa6
Add lm-eval directly to requirements-test.txt ( #9161 )
2024-10-08 18:22:31 -07:00
Russell Bryant
2a131965a8
mypy: check additional directories ( #9162 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-08 22:08:22 +00:00
bnellnm
bd37b9fbe2
[Bugfix] Try to handle older versions of pytorch ( #9086 )
2024-10-08 14:28:12 -07:00
Rafael Vasquez
de24046fcd
[Doc] Improve contributing and installation documentation ( #9132 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-08 20:22:08 +00:00
Sayak Paul
1874c6a1b0
[Doc] Update vlm.rst to include an example on videos ( #9155 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-08 18:12:29 +00:00
Daniele
9a94ca4a5d
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing ( #8537 )
2024-10-08 09:38:40 -07:00
Peter Pan
cfba685bd4
[CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models ( #8758 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-10-08 09:37:34 -07:00
Alex Brooks
069d3bd8d0
[Frontend] Add Early Validation For Chat Template / Tool Call Parser ( #9151 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:31:26 +00:00
Alex Brooks
a3691b6b5e
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs ( #9131 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:12:56 +00:00
Brendan Wong
8c746226c9
[Frontend] API support for beam search for MQLLMEngine ( #9117 )
2024-10-08 05:51:43 +00:00
youkaichao
e1faa2a598
[misc] improve ux on readme ( #9147 )
2024-10-07 22:26:25 -07:00
Kunshang Ji
80b57f00d5
[Intel GPU] Fix xpu decode input ( #9145 )
2024-10-08 03:51:14 +00:00
youkaichao
04c12f8157
[misc] update utils to support comparing multiple settings ( #9140 )
2024-10-08 02:51:49 +00:00
Simon Mo
8eeb857084
Add Slack to README ( #9137 )
2024-10-07 17:06:21 -07:00
youkaichao
fa45513a51
[misc] fix comment and variable name ( #9139 )
2024-10-07 16:07:05 -07:00
Kuntai Du
c0d9a98d0c
[Doc] Include performance benchmark in README ( #9135 )
2024-10-07 15:04:06 -07:00
Russell Bryant
e0dbdb013d
[CI/Build] Add linting for github actions workflows ( #7876 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-07 21:18:10 +00:00
TimWang
93cf74a8a7
[Doc]: Add deploying_with_k8s guide ( #8451 )
2024-10-07 13:31:45 -07:00
Cyrus Leung
151ef4efd2
[Model] Support NVLM-D and fix QK Norm in InternViT ( #9045 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2024-10-07 11:55:12 +00:00
Isotr0py
f19da64871
[Core] Refactor GGUF parameters packing and forwarding ( #8859 )
2024-10-07 10:01:46 +00:00
Isotr0py
4f95ffee6f
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend ( #9089 )
2024-10-07 06:50:35 +00:00
Cyrus Leung
8c6de96ea1
[Model] Explicit interface for vLLM models and support OOT embedding models ( #9108 )
2024-10-07 06:10:35 +00:00
youkaichao
18b296fdb2
[core] remove beam search from the core ( #9105 )
2024-10-07 05:47:04 +00:00
sroy745
c8f26bb636
[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None ( #9103 )
2024-10-07 03:52:42 +00:00
Isotr0py
487678d046
[Bugfix][Hardware][CPU] Fix CPU model input for decode ( #9044 )
2024-10-06 19:14:27 -07:00
Varun Sundar Rabindranath
cb3b2b9ba4
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling ( #9038 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-06 12:48:11 -07:00
Yanyi Liu
fdf59d30ea
[Bugfix] fix tool_parser error handling when serve a model not support it ( #8709 )
2024-10-06 12:51:08 +00:00
Cyrus Leung
b22b798471
[Model] PP support for embedding models and update docs ( #9090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-06 16:35:27 +08:00
Cyrus Leung
f22619fe96
[Misc] Remove user-facing error for removed VLM args ( #9104 )
2024-10-06 01:33:52 -07:00
Brendan Wong
168cab6bbf
[Frontend] API support for beam search ( #9087 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-05 23:39:03 -07:00
TJian
23fea8714a
[Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model ( #9101 )
2024-10-06 13:00:04 +08:00
youkaichao
f4dd830e09
[core] use forward context for flash infer ( #9097 )
2024-10-05 19:37:31 -07:00
Andy Dai
5df1834895
[Bugfix] Fix order of arguments matters in config.yaml ( #8960 )
2024-10-05 17:35:11 +00:00
Chen Zhang
cfadb9c687
[Bugfix] Deprecate registration of custom configs to huggingface ( #9083 )
2024-10-05 21:56:40 +08:00
Xin Yang
15986f598c
[Model] Support Gemma2 embedding model ( #9004 )
2024-10-05 06:57:05 +00:00
hhzhang16
53b3a33027
[Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs ( #8979 )
2024-10-04 22:05:37 -07:00
Chen Zhang
dac914b0d6
[Bugfix] use blockmanagerv1 for encoder-decoder ( #9084 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-05 04:45:38 +00:00
Zhuohan Li
a95354a36e
[Doc] Update README.md with Ray summit slides ( #9088 )
2024-10-05 02:54:45 +00:00
youkaichao
663874e048
[torch.compile] improve allreduce registration ( #9061 )
2024-10-04 16:43:50 -07:00
Chongming Ni
cc90419e89
[Hardware][Neuron] Add on-device sampling support for Neuron ( #8746 )
...
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com >
2024-10-04 16:42:20 -07:00
Cody Yu
27302dd584
[Misc] Fix CI lint ( #9085 )
2024-10-04 16:07:54 -07:00
Andy Dai
0cc566ca8f
[Misc] Add random seed for prefix cache benchmark ( #9081 )
2024-10-04 21:58:57 +00:00
Andy Dai
05c531be47
[Misc] Improved prefix cache example ( #9077 )
2024-10-04 21:38:42 +00:00
Kuntai Du
fbb74420e7
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang ( #7412 )
2024-10-04 14:01:44 -07:00
ElizaWszola
05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE ( #8973 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Dipika Sikka <ds3822@columbia.edu >
2024-10-04 12:34:44 -06:00
Flávia Béo
0dcc8cbe5a
Adds truncate_prompt_tokens param for embeddings creation ( #8999 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
2024-10-04 18:31:40 +00:00
Roger Wang
26aa325f4f
[Core][VLM] Test registration for OOT multimodal models ( #8717 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:38:25 -07:00
Varad Ahirwadkar
e5dc713c23
[Hardware][PowerPC] Make oneDNN dependency optional for Power ( #9039 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-04 17:24:42 +00:00
Simon Mo
36eecfbddb
Remove AMD Ray Summit Banner ( #9075 )
2024-10-04 10:17:16 -07:00
Prashant Gupta
9ade8bbc8d
[Model] add a bunch of supported lora modules for mixtral ( #9008 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-04 16:24:40 +00:00
Lucas Wilkinson
22482e495e
[Bugfix] Flash attention arches not getting set properly ( #9062 )
2024-10-04 09:43:15 -06:00
whyiug
3d826d2c52
[Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL ( #9071 )
2024-10-04 14:34:58 +00:00
Cyrus Leung
0e36fd4909
[Misc] Move registry to its own file ( #9064 )
2024-10-04 10:01:37 +00:00
Murali Andoorveedu
0f6d7a9a34
[Models] Add remaining model PP support ( #7168 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:56:58 +08:00
Michael Goin
303d44790a
[Misc] Enable multi-step output streaming by default ( #9047 )
2024-10-03 22:55:42 -04:00
Lucas Wilkinson
aeb37c2a72
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) ( #8845 )
2024-10-03 22:55:25 -04:00
代君
3dbb215b38
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model ( #8405 )
2024-10-04 10:36:39 +08:00
Domen Vreš
2838d6b38e
[Bugfix] Weight loading fix for OPT model ( #9042 )
...
Co-authored-by: dvres <dvres@fri.uni-lj.si >
2024-10-03 19:53:29 -04:00
sroy745
91add85ec4
Fix failing spec decode test ( #9054 )
2024-10-03 23:07:29 +00:00
youkaichao
9aaf14c62e
[misc] add forward context for attention ( #9029 )
2024-10-03 12:09:42 -07:00
xendo
63e39937f9
[Frontend] [Neuron] Parse literals out of override-neuron-config ( #8959 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-03 18:02:07 +00:00
sroy745
f5d72b2fc6
[Core] Make BlockSpaceManagerV2 the default BlockManager to use. ( #8678 )
2024-10-03 09:44:21 -07:00
Guillaume Calmettes
83caf35e08
[BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser ( #9020 )
2024-10-03 16:44:52 +08:00
Divakar Verma
01843c89b8
[Misc] log when using default MoE config ( #8971 )
2024-10-03 04:31:07 +00:00
Travis Johnson
19a4dd0990
[Bugfix] example template should not add parallel_tool_prompt if tools is none ( #9007 )
2024-10-03 03:04:17 +00:00
Nick Hill
18c2e30c57
[Doc] Update Granite model docs ( #9025 )
2024-10-03 02:42:24 +00:00
Shawn Tan
19f0d25796
[Model] Adding Granite MoE. ( #8206 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-03 09:33:57 +08:00
Sergey Shlyapnikov
f58d4fccc9
[OpenVINO] Enable GPU support for OpenVINO vLLM backend ( #8192 )
2024-10-02 17:50:01 -04:00
Varun Sundar Rabindranath
afb050b29d
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill ( #8645 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-02 19:44:39 +00:00
Alex Brooks
7f60520deb
[Misc] Update Default Image Mapper Error Log ( #8977 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-02 11:44:38 +00:00
afeldman-nm
563649aafe
[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching ( #8804 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
2024-10-02 07:52:20 +00:00
Lily Liu
1570203864
[Spec Decode] (1/2) Remove batch expansion ( #8839 )
2024-10-01 16:04:42 -07:00
vlsav
22f5851b80
Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows ( #8997 )
2024-10-01 11:07:06 -07:00
Cyrus Leung
4f341bd4bf
[Doc] Update list of supported models ( #8987 )
2024-10-02 00:35:39 +08:00
Sebastian Schoennenbeck
35bd215168
[Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API ( #8965 )
2024-10-01 09:58:06 +00:00
Alex Brooks
1fe0a4264a
[Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders ( #8991 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-01 09:52:44 +00:00
Isotr0py
bc4eb65b54
[Bugfix] Fix Fuyu tensor parallel inference ( #8986 )
2024-10-01 17:51:41 +08:00
Divakar Verma
82f3937e59
[Misc] add process_weights_after_loading for DummyLoader ( #8969 )
2024-10-01 03:46:41 +00:00
youkaichao
7da2487591
[torch.compile] fix tensor alias ( #8982 )
2024-10-01 03:40:48 +00:00
Kevin H. Luu
aaccca2b4d
[CI/Build] Fix machete generated kernel files ordering ( #8976 )
...
Signed-off-by: kevin <kevin@anyscale.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-01 03:33:12 +00:00
Joe Runde
062c89e7c9
[Frontend][Core] Move guided decoding params into sampling params ( #8252 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-01 09:34:25 +08:00
Lily Liu
bce324487a
[CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. ( #8975 )
2024-10-01 00:51:40 +00:00
Kevin H. Luu
1425a1bcf9
[ci] Add CODEOWNERS for test directories ( #8795 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-01 00:47:08 +00:00
Jee Jee Li
1cabfcefb6
[Misc] Adjust max_position_embeddings for LoRA compatibility ( #8957 )
2024-09-30 12:57:39 +00:00
Sebastian Schoennenbeck
be76e5aabf
[Core] Make scheduling policy settable via EngineArgs ( #8956 )
2024-09-30 12:28:44 +00:00
Isotr0py
2ae25f79cf
[Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg ( #8946 )
2024-09-30 13:01:20 +08:00
Jee Jee Li
8e60afa15e
[Model][LoRA]LoRA support added for MiniCPMV2.6 ( #8943 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-30 04:31:55 +00:00
Roger Wang
b6d7392579
[Misc][CI/Build] Include cv2 via mistral_common[opencv] ( #8951 )
2024-09-30 04:28:26 +00:00
whyiug
e01ab595d8
[Model] support input embeddings for qwen2vl ( #8856 )
2024-09-30 03:16:10 +00:00
Mor Zusman
f13a07b1f8
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model ( #8533 )
2024-09-29 17:35:58 -04:00
danieljannai21
6c9ba48fde
[Frontend] Added support for HF's new continue_final_message parameter ( #8942 )
2024-09-29 17:59:47 +00:00
juncheoll
1fb9c1b0bf
[Misc] Fix typo in BlockSpaceManagerV1 ( #8944 )
2024-09-29 15:05:54 +00:00
Nick Hill
31f46a0d35
[BugFix] Fix seeded random sampling with encoder-decoder models ( #8870 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-29 09:43:14 +00:00
Jee Jee Li
3d49776bbb
[Model][LoRA]LoRA support added for MiniCPMV2.5 ( #7199 )
2024-09-29 06:59:45 +00:00
Zilin Zhu
bc2ef1f77c
[Model] Support Qwen2.5-Math-RM-72B ( #8896 )
2024-09-28 21:19:39 -07:00
Tyler Michael Smith
2e7fe7e79f
[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching ( #8930 )
2024-09-29 03:13:01 +00:00
Cyrus Leung
26a68d5d7e
[CI/Build] Add test decorator for minimum GPU memory ( #8925 )
2024-09-29 02:50:51 +00:00
ElizaWszola
d081da0064
[Bugfix] Fix Marlin MoE act order when is_k_full == False ( #8741 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-28 18:19:40 -07:00
sroy745
5bf8789b2a
[Bugfix] Block manager v2 with preemption and lookahead slots ( #8824 )
2024-09-29 09:17:45 +08:00
Russell Bryant
d1537039ce
[Core] Improve choice of Python multiprocessing method ( #8823 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-29 09:17:07 +08:00
youkaichao
cc276443b5
[doc] organize installation doc and expose per-commit docker ( #8931 )
2024-09-28 17:48:41 -07:00
Chen Zhang
e585b583a9
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 ( #8891 )
2024-09-28 18:51:22 +00:00
Edouard B.
090e945e36
[Frontend] Make beam search emulator temperature modifiable ( #8928 )
...
Co-authored-by: Eduard Balzin <nfunctor@yahoo.fr >
2024-09-28 11:30:21 -07:00
Cyrus Leung
e1a3f5e831
[CI/Build] Update models tests & examples ( #8874 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-28 09:54:35 -07:00
Varun Sundar Rabindranath
19d02ff938
[Bugfix] Fix PP for Multi-Step ( #8887 )
2024-09-28 08:52:46 -07:00
tastelikefeet
39d3f8d94f
[Bugfix] Fix code for downloading models from modelscope ( #8443 )
2024-09-28 08:24:12 -07:00
Cyrus Leung
b0298aa8cc
[Misc] Remove vLLM patch of BaichuanTokenizer ( #8921 )
2024-09-28 08:11:25 +00:00
Tyler Titsworth
260024a374
[Bugfix][Intel] Fix XPU Dockerfile Build ( #7824 )
...
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-27 23:45:50 -07:00
youkaichao
d86f6b2afb
[misc] fix wheel name ( #8919 )
2024-09-27 22:10:44 -07:00
Sebastian Schoennenbeck
bd429f2b75
[Core] Priority-based scheduling in async engine ( #8850 )
2024-09-27 15:07:10 -07:00
youkaichao
18e60d7d13
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag ( #8911 )
2024-09-27 14:27:56 -07:00
Varun Sundar Rabindranath
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path ( #8378 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-09-27 13:32:07 -07:00
Lucas Wilkinson
c5d55356f9
[Bugfix] fix for deepseek w4a16 ( #8906 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-27 13:12:34 -06:00
Luka Govedič
172d1cd276
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method ( #7271 )
2024-09-27 14:25:10 -04:00
youkaichao
a9b15c606f
[torch.compile] use empty tensor instead of None for profiling ( #8875 )
2024-09-27 08:11:32 -07:00
Brittany
8df2dc3c88
[TPU] Update pallas.py to support trillium ( #8871 )
2024-09-27 01:16:55 -07:00
Isotr0py
6d792d2f31
[Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1 ( #8892 )
2024-09-27 01:15:58 -07:00
Peter Pan
0e088750af
[MISC] Fix invalid escape sequence '\' ( #8830 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-09-27 01:13:25 -07:00
youkaichao
dc4e3df5c2
[misc] fix collect env ( #8894 )
2024-09-27 00:26:38 -07:00
Cyrus Leung
3b00b9c26c
[Core] renamePromptInputs and inputs ( #8876 )
2024-09-26 20:35:15 -07:00
Maximilien de Bayser
344cd2b6f4
[Feature] Add support for Llama 3.1 and 3.2 tool use ( #8343 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-09-26 17:01:42 -07:00
Cyrus Leung
1b49148e47
[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility ( #8764 )
2024-09-26 16:54:09 -07:00
Nick Hill
4b377d6feb
[BugFix] Fix test breakages from transformers 4.45 upgrade ( #8829 )
2024-09-26 16:46:43 -07:00
Tyler Michael Smith
71d21c73ab
[Bugfix] Fixup advance_step.cu warning ( #8815 )
2024-09-26 16:23:45 -07:00
Chirag Jain
ee2da3e9ef
fix validation: Only set tool_choice auto if at least one tool is provided ( #8568 )
2024-09-26 16:23:17 -07:00
Tyler Michael Smith
e2f6f26e86
[Bugfix] Fix print_warning_once's line info ( #8867 )
2024-09-26 16:18:26 -07:00
Michael Goin
b28d2104de
[Misc] Change dummy profiling and BOS fallback warns to log once ( #8820 )
2024-09-26 16:18:14 -07:00
Pernekhan Utemuratov
93d364da34
[Bugfix] Include encoder prompts len to non-stream api usage response ( #8861 )
2024-09-26 15:47:00 -07:00
Kevin H. Luu
d9cfbc891e
[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM ( #8872 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-26 15:02:16 -07:00
youkaichao
70de39f6b4
[misc][installation] build from source without compilation ( #8818 )
2024-09-26 13:19:04 -07:00
fyuan1316
68988d4e0d
[CI/Build] Fix missing ci dependencies ( #8834 )
2024-09-26 11:04:39 -07:00
Michael Goin
520db4dbc1
[Docs] Add README to the build docker image ( #8825 )
2024-09-26 11:02:52 -07:00
Tyler Michael Smith
f70bccac75
[Build/CI] Upgrade to gcc 10 in the base build Docker image ( #8814 )
2024-09-26 10:07:18 -07:00
Roger Wang
4bb98f2190
[Misc] Update config loading for Qwen2-VL and remove Granite ( #8837 )
2024-09-26 07:45:30 -07:00
Michael Goin
7193774b1f
[Misc] Support quantization of MllamaForCausalLM ( #8822 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-09-25 14:46:22 -07:00
Roger Wang
e2c6e0a829
[Doc] Update doc for Transformers 4.45 ( #8817 )
2024-09-25 13:29:48 -07:00
Chen Zhang
770ec6024f
[Model] Add support for the multi-modal Llama 3.2 model ( #8811 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chang Su <chang.s.su@oracle.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-25 13:29:32 -07:00
Simon Mo
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility ( #8760 ) ( #8810 )
2024-09-25 10:36:26 -07:00
Michael Goin
873edda6cf
[Misc] Support FP8 MoE for compressed-tensors ( #8588 )
2024-09-25 09:43:36 -07:00
科英
64840dfae4
[Frontend] MQLLMEngine supports profiling. ( #8761 )
2024-09-25 09:37:41 -07:00
Cyrus Leung
28e1299e60
rename PromptInputs and inputs with backward compatibility ( #8760 )
2024-09-25 09:36:47 -07:00
DefTruth
0c4d2ad5e6
[VLM][Bugfix] internvl with num_scheduler_steps > 1 ( #8614 )
2024-09-25 09:35:53 -07:00
Jee Jee Li
c6f2485c82
[[Misc]] Add extra deps for openai server image ( #8792 )
2024-09-25 09:35:23 -07:00
bnellnm
300da09177
[Kernel] Fullgraph and opcheck tests ( #8479 )
2024-09-25 08:35:52 -06:00
Hongxia Yang
1c046447a6
[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade ( #8777 )
2024-09-25 22:26:37 +08:00
Woo-Yeon Lee
8fae5ed7f6
[Misc] Fix minor typo in scheduler ( #8765 )
2024-09-25 00:53:03 -07:00
David Newman
3368c3ab36
[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node ( #8767 )
...
Signed-off-by: darthhexx <darthhexx@gmail.com >
2024-09-25 00:52:26 -07:00
Adam Tilghman
1ac3de09cd
[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer ( #8672 )
2024-09-25 07:49:26 +00:00
sohamparikh
3e073e66f1
[Bugfix] load fc bias from config for eagle ( #8790 )
2024-09-24 23:16:30 -07:00
Isotr0py
c23953675f
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend ( #8770 )
2024-09-24 23:16:11 -07:00
zifeitong
e3dd0692fa
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv ( #8250 )
2024-09-25 05:53:43 +00:00
sroy745
fc3afc20df
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 ( #8752 )
2024-09-24 21:26:36 -07:00
sasha0552
b4522474a3
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal ( #8776 )
2024-09-24 21:26:33 -07:00
sroy745
ee777d9c30
Fix test_schedule_swapped_simple in test_scheduler.py ( #8780 )
2024-09-24 21:26:18 -07:00
Joe Runde
6e0c9d6bd0
[Bugfix] Use heartbeats instead of health checks ( #8583 )
2024-09-24 20:37:38 -07:00
Archit Patke
6da1ab6b41
[Core] Adding Priority Scheduling ( #5958 )
2024-09-24 19:50:50 -07:00
Travis Johnson
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding ( #8047 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-09-24 17:29:56 -07:00
Jee Jee Li
13f9f7a3d0
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 ( #8768 )
2024-09-24 17:08:55 -07:00
youkaichao
1e7d5c01f5
[misc] soft drop beam search ( #8763 )
2024-09-24 15:48:39 -07:00
Daniele
2467b642dd
[CI/Build] fix setuptools-scm usage ( #8771 )
2024-09-24 12:38:12 -07:00
Lucas Wilkinson
72fc97a0f1
[Bugfix] Fix torch dynamo fixes caused by replace_parameters ( #8748 )
2024-09-24 14:33:21 -04:00
Andy
2529d09b5a
[Frontend] Batch inference for llm.chat() API ( #8648 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-09-24 09:44:11 -07:00
ElizaWszola
a928ded995
[Kernel] Split Marlin MoE kernels into multiple files ( #8661 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-24 09:31:42 -07:00
Hanzhi Zhou
cc4325b66a
[Bugfix] Fix potentially unsafe custom allreduce synchronization ( #8558 )
2024-09-24 01:08:14 -07:00
Alex Brooks
8ff7ced996
[Model] Expose Phi3v num_crops as a mm_processor_kwarg ( #8658 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-24 07:36:46 +00:00
Peter Salas
3f06bae907
[Core][Model] Support loading weights by ID within models ( #7931 )
2024-09-24 07:14:15 +00:00
Cody Yu
b8747e8a7c
[MISC] Skip dumping inputs when unpicklable ( #8744 )
2024-09-24 06:10:03 +00:00
Simon Mo
3185fb0cca
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" ( #8750 )
2024-09-24 05:45:20 +00:00
youkaichao
0250dd68c5
re-implement beam search on top of vllm core ( #8726 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-09-23 22:08:12 -07:00
sroy745
88577ac928
Fix tests in test_scheduler.py that fail with BlockManager V2 ( #8728 )
2024-09-24 04:43:13 +00:00
Hongxia Yang
530821d00c
[Hardware][AMD] ROCm6.2 upgrade ( #8674 )
2024-09-23 18:52:39 -07:00
Alexander Matveev
1a2aef3e59
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse ( #8335 )
2024-09-23 15:38:04 -07:00
jiqing-feng
5f7bb58427
Fix typical acceptance sampler with correct recovered token ids ( #8562 )
2024-09-23 12:32:27 -07:00
Russell Bryant
b05f5c9238
[Core] Allow IPv6 in VLLM_HOST_IP with zmq ( #8575 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-09-23 12:15:41 -07:00
Jee Jee Li
9b0e3ec970
[Kernel][LoRA] Add assertion for punica sgmv kernels ( #7585 )
2024-09-23 18:57:42 +00:00
Lucas Wilkinson
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin ( #7701 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-23 13:46:26 -04:00
Daniele
ee5f34b1c2
[CI/Build] use setuptools-scm to set __version__ ( #4738 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-23 09:44:26 -07:00
Jani Monoses
f2bd246c17
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size ( #8707 )
2024-09-23 14:43:09 +00:00
Yanyi Liu
a79e522984
[Model] Support pp for qwen2-vl ( #8696 )
2024-09-23 13:46:59 +00:00
Li, Jiang
3e83c12b5c
[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner ( #8733 )
2024-09-23 13:15:16 +00:00
Isotr0py
e551ca1555
[Hardware][CPU] Refactor CPU model runner ( #8729 )
2024-09-23 20:12:20 +08:00
Alex Brooks
9b8c8ba119
[Core][Frontend] Support Passing Multimodal Processor Kwargs ( #8657 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-23 07:44:48 +00:00
Yan Ma
d23679eb99
[Bugfix] fix docker build for xpu ( #8652 )
2024-09-22 22:54:18 -07:00
Luka Govedič
57a0702e63
[Bugfix] Fix CPU CMake build ( #8723 )
...
Co-authored-by: Yuan <yuan.zhou@intel.com >
2024-09-22 20:40:46 -07:00
Tyler Michael Smith
3dda7c2250
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building ( #8702 )
2024-09-22 22:24:59 -04:00
youkaichao
92ba7e7477
[misc] upgrade mistral-common ( #8715 )
2024-09-22 15:41:59 -07:00
youkaichao
d4a2ac8302
[build] enable existing pytorch (for GH200, aarch64, nightly) ( #8713 )
2024-09-22 12:47:54 -07:00
Lily Liu
c6bd70d772
[SpecDec][Misc] Cleanup, remove bonus token logic. ( #8701 )
2024-09-22 12:34:14 -07:00
litianjian
5b59532760
[Model][VLM] Add LLaVA-Onevision model support ( #8486 )
...
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-22 10:51:44 -07:00
Huazhong Ji
ca2b628b3c
[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler ( #8703 )
2024-09-22 10:44:09 -07:00
Alex Brooks
8ca5051b9a
[Misc] Use NamedTuple in Multi-image example ( #8705 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-22 20:56:20 +08:00
Cyrus Leung
06ed2815e2
[Model] Refactor BLIP/BLIP-2 to support composite model loading ( #8407 )
2024-09-22 12:24:21 +00:00
youkaichao
0e40ac9b7b
[ci][build] fix vllm-flash-attn ( #8699 )
2024-09-21 23:24:58 -07:00
Isotr0py
13d88d4137
[Bugfix] Refactor composite weight loading logic ( #8656 )
2024-09-22 04:33:27 +00:00
Tyler Michael Smith
d66ac62854
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu ( #8643 )
2024-09-21 23:45:02 +00:00
Divakar Verma
9dc7c6c7f3
[dbrx] refactor dbrx experts to extend FusedMoe class ( #8518 )
2024-09-21 15:09:39 -06:00
rasmith
ec4aaad812
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 ( #8646 )
2024-09-21 09:20:54 +00:00
Andy Dai
4dfdf43196
[Doc] Fix typo in AMD installation guide ( #8689 )
2024-09-21 00:24:12 -07:00
Cyrus Leung
5e85f4f82a
[VLM] Use SequenceData.from_token_counts to create dummy data ( #8687 )
2024-09-20 23:28:56 -07:00
Luka Govedič
71c60491f2
[Kernel] Build flash-attn from source ( #8245 )
2024-09-20 23:27:10 -07:00
youkaichao
0faab90eb0
[beam search] add output for manually checking the correctness ( #8684 )
2024-09-20 19:55:33 -07:00
Cyrus Leung
0455c46ed4
[Core] Factor out common code in SequenceData and Sequence ( #8675 )
2024-09-21 02:30:39 +00:00
Kunshang Ji
d4bf085ad0
[MISC] add support custom_op check ( #8557 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-20 19:03:55 -07:00
Cyrus Leung
0057894ef7
[Core] Rename PromptInputs and inputs( #8673 )
2024-09-20 19:00:54 -07:00
zyddnys
0f961b3ce9
[Bugfix] Fix incorrect llava next feature size calculation ( #8496 )
2024-09-20 22:48:32 +00:00
omrishiv
7f9c8902e3
[Hardware][AWS] update neuron to 2.20 ( #8676 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-09-20 15:19:44 -07:00
omrishiv
7c8566aa4f
[Doc] neuron documentation update ( #8671 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-09-20 15:04:37 -07:00
Patrick von Platen
b4e4eda92e
[Bugfix][Core] Fix tekken edge case for mistral tokenizer ( #8640 )
2024-09-20 14:33:03 -07:00
Pastel!
2874bac618
[Bugfix] Config got an unexpected keyword argument 'engine' ( #8556 )
2024-09-20 14:00:45 -07:00
Cyrus Leung
035fa895ec
[Misc] Show AMD GPU topology in collect_env.py ( #8649 )
2024-09-20 13:52:19 -07:00
saumya-saran
b28298f2f4
[Bugfix] Validate SamplingParam n is an int ( #8548 )
2024-09-20 12:46:02 -07:00
Alexey Kondratiev(AMD)
2940afa04e
[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build ( #8670 )
2024-09-20 10:27:44 -07:00
Niklas Muennighoff
3b63de9353
[Model] Add OLMoE ( #7922 )
2024-09-20 09:31:41 -07:00
Jiaxin Shan
260d40b5ea
[Core] Support Lora lineage and base model metadata management ( #6315 )
2024-09-20 06:20:56 +00:00
William Lin
9e5ec35b1f
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata ( #8474 )
2024-09-19 20:49:54 -07:00
Amit Garg
18ae428a0d
[Bugfix] Fix Phi3.5 mini and MoE LoRA inference ( #8571 )
2024-09-20 08:54:02 +08:00
bnellnm
de6f90a13d
[Misc] guard against change in cuda library name ( #8609 )
2024-09-20 06:36:30 +08:00
Alexey Kondratiev(AMD)
6cb748e190
[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail ( #8551 )
2024-09-19 13:06:32 -07:00
Simon Mo
9e99407e3c
Create SECURITY.md ( #8642 )
2024-09-19 12:16:28 -07:00
Isotr0py
ea4647b7d7
[Doc] Add documentation for GGUF quantization ( #8618 )
2024-09-19 13:15:55 -06:00
盏一
e42c634acb
[Core] simplify logits resort in _apply_top_k_top_p ( #8619 )
2024-09-19 18:28:25 +00:00
Charlie Fu
9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention ( #8577 )
2024-09-19 17:37:57 +00:00
Nick Hill
76515f303b
[Frontend] Use MQLLMEngine for embeddings models too ( #8584 )
2024-09-19 12:51:06 -04:00
Kunshang Ji
855c8ae2c9
[MISC] remove engine_use_ray in benchmark_throughput.py ( #8615 )
2024-09-18 22:33:20 -07:00
Kuntai Du
c52ec5f034
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py ( #8616 )
2024-09-19 05:24:24 +00:00
Roger Wang
02c9afa2d0
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" ( #8593 )
2024-09-19 04:14:28 +00:00
sroy745
3118f63385
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. ( #8545 )
2024-09-19 02:24:15 +00:00
Tyler Michael Smith
4c34ce8916
[Kernel] Remove marlin moe templating on thread_m_blocks ( #8573 )
...
Co-authored-by: lwilkinson@neuralmagic.com
2024-09-19 01:42:49 +00:00
Joe Runde
0d47bf3bf4
[Bugfix] add dead_error property to engine client ( #8574 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-18 22:10:01 +00:00
Nick Hill
d9cd78eb71
[BugFix] Nonzero exit code if MQLLMEngine startup fails ( #8572 )
2024-09-18 20:17:55 +00:00
Tyler Michael Smith
db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching ( #8039 )
2024-09-18 20:05:06 +00:00
Gregory Shtrasberg
b3195bc9e4
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call ( #8380 )
...
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-18 10:41:08 -07:00
Geun, Lim
e18749ff09
[Model] Support Solar Model ( #8386 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-18 11:04:00 -06:00
Russell Bryant
d65798f78c
[Core] zmq: bind only to 127.0.0.1 for local-only usage ( #8543 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-09-18 16:10:27 +00:00
afeldman-nm
a8c1d161a7
[Core] *Prompt* logprobs support in Multi-step ( #8199 )
2024-09-18 08:38:43 -07:00
Alexander Matveev
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH ( #8157 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-18 13:56:58 +00:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version ( #8469 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization ( #8534 )
2024-09-18 10:38:11 +00:00
Jiaxin Shan
e351572900
[Misc] Add argument to disable FastAPI docs ( #8554 )
2024-09-18 09:51:59 +00:00
Daniele
95965d31b6
[CI/Build] fix Dockerfile.cpu on podman ( #8540 )
2024-09-18 10:49:53 +08:00
Tyler Michael Smith
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching ( #8012 )
2024-09-17 23:44:27 +00:00
Alexey Kondratiev(AMD)
09deb4721f
[CI/Build] Excluding kernels/test_gguf.py from ROCm ( #8520 )
2024-09-17 16:40:29 -07:00
youkaichao
fa0c114fad
[doc] improve installation doc ( #8550 )
...
Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com >
2024-09-17 16:24:06 -07:00
Joe Runde
98f9713399
[Bugfix] Fix TP > 1 for new granite ( #8544 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-17 23:17:08 +00:00
Nick Hill
56c3de018c
[Misc] Don't dump contents of kvcache tensors on errors ( #8527 )
2024-09-17 12:24:29 -07:00
Patrick von Platen
a54ed80249
[Model] Add mistral function calling format to all models loaded with "mistral" format ( #8515 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-17 17:50:37 +00:00
chenqianfzh
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization ( #8434 )
2024-09-17 08:09:12 -07:00
sroy745
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models ( #7631 )
2024-09-17 07:35:01 -07:00
Isotr0py
1b6de8352b
[Benchmark] Support sample from HF datasets and image input for benchmark_serving ( #8495 )
2024-09-17 07:34:27 +00:00
Rui Qiao
cbdb252259
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change ( #8509 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-09-17 00:06:26 -07:00
youkaichao
99aa4eddaf
[torch.compile] register allreduce operations as custom ops ( #8526 )
2024-09-16 22:57:57 -07:00
Roger Wang
ee2bceaaa6
[Misc][Bugfix] Disable guided decoding for mistral tokenizer ( #8521 )
2024-09-16 22:22:45 -07:00
Alex Brooks
1c1bb388e0
[Frontend] Improve Nullable kv Arg Parsing ( #8525 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-17 04:17:32 +00:00
Simon Mo
546034b466
[refactor] remove triton based sampler ( #8524 )
2024-09-16 20:04:48 -07:00
Joe Runde
cca61642e0
[Bugfix] Fix 3.12 builds on main ( #8510 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-17 00:01:45 +00:00
Simon Mo
5ce45eb54d
[misc] small qol fixes for release process ( #8517 )
2024-09-16 15:11:27 -07:00
Simon Mo
5478c4b41f
[perf bench] set timeout to debug hanging ( #8516 )
2024-09-16 14:30:02 -07:00
Kevin Lin
47f5e03b5b
[Bugfix] Bind api server port before starting engine ( #8491 )
2024-09-16 13:56:28 -07:00
youkaichao
2759a43a26
[doc] update doc on testing and debugging ( #8514 )
2024-09-16 12:10:23 -07:00
Luka Govedič
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels ( #7270 )
2024-09-16 11:52:40 -07:00
sasha0552
781e3b9a42
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel ( #8506 )
2024-09-16 12:15:57 -06:00
Nick Hill
acd5511b6d
[BugFix] Fix clean shutdown issues ( #8492 )
2024-09-16 09:33:46 -07:00
lewtun
837c1968f9
[Frontend] Expose revision arg in OpenAI server ( #8501 )
2024-09-16 15:55:26 +00:00
ElizaWszola
a091e2da3e
[Kernel] Enable 8-bit weights in Fused Marlin MoE ( #8032 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
2024-09-16 09:47:19 -06:00
Isotr0py
fc990f9795
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel ( #8357 )
2024-09-15 16:51:44 -06:00
Chris
3724d5f6b5
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations ( #8490 )
2024-09-15 04:20:05 +00:00
Woosuk Kwon
50e9ec41fc
[TPU] Implement multi-step scheduling ( #8489 )
2024-09-14 16:58:31 -07:00
youkaichao
47790f3e32
[torch.compile] add a flag to disable custom op ( #8488 )
2024-09-14 13:07:16 -07:00
youkaichao
a36e070dad
[torch.compile] fix functionalization ( #8480 )
2024-09-14 09:46:04 -07:00
ywfang
8a0cf1ddc3
[Model] support minicpm3 ( #8297 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-14 14:50:26 +00:00
Charlie Fu
1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm ( #8310 )
2024-09-13 17:01:11 -07:00
Kunshang Ji
851725202a
[Hardware][intel GPU] bump up ipex version to 2.3 ( #8365 )
...
Co-authored-by: Yan Ma <yan.ma@intel.com >
2024-09-13 16:54:34 -07:00
Simon Mo
9ba0817ff1
bump version to v0.6.1.post2 ( #8473 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-09-13 11:35:00 -07:00
Nick Hill
18e9e1f7b3
[HotFix] Fix final output truncation with stop string + streaming ( #8468 )
2024-09-13 11:31:12 -07:00
Isotr0py
f57092c00b
[Doc] Add oneDNN installation to CPU backend documentation ( #8467 )
2024-09-13 18:06:30 +00:00
Cyrus Leung
a84e598e21
[CI/Build] Reorganize models tests ( #7820 )
2024-09-13 10:20:06 -07:00
youkaichao
0a4806f0a9
[plugin][torch.compile] allow to add custom compile backend ( #8445 )
2024-09-13 09:32:42 -07:00
Cyrus Leung
ecd7a1d5b6
[Installation] Gate FastAPI version for Python 3.8 ( #8456 )
2024-09-13 09:02:26 -07:00
youkaichao
a2469127db
[misc][ci] fix quant test ( #8449 )
2024-09-13 17:20:14 +08:00
Jee Jee Li
06311e2956
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 ( #8442 )
2024-09-13 07:58:28 +00:00
youkaichao
cab69a15e4
[doc] recommend pip instead of conda ( #8446 )
2024-09-12 23:52:41 -07:00
Isotr0py
9b4a3b235e
[CI/Build] Enable InternVL2 PP test only on single node ( #8437 )
2024-09-13 06:35:20 +00:00
Simon Mo
acda0b35d0
bump version to v0.6.1.post1 ( #8440 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-09-12 21:39:49 -07:00
William Lin
ba77527955
[bugfix] torch profiler bug for single gpu with GPUExecutor ( #8354 )
2024-09-12 21:30:00 -07:00
Alexander Matveev
6821020109
[Bugfix] Fix async log stats ( #8417 )
2024-09-12 20:48:59 -07:00
Cyrus Leung
8427550488
[CI/Build] Update pixtral tests to use JSON ( #8436 )
2024-09-13 03:47:52 +00:00
Cyrus Leung
3f79bc3d1a
[Bugfix] Bump fastapi and pydantic version ( #8435 )
2024-09-13 03:21:42 +00:00
shangmingc
40c396533d
[Bugfix] Mapping physical device indices for e2e test utils ( #8290 )
2024-09-13 11:06:28 +08:00
Cyrus Leung
5ec9c0fb3c
[Core] Factor out input preprocessing to a separate class ( #7329 )
2024-09-13 02:56:13 +00:00
Dipika Sikka
8f44a92d85
[BugFix] fix group_topk ( #8430 )
2024-09-13 09:23:42 +08:00
Roger Wang
360ddbd37e
[Misc] Update Pixtral example ( #8431 )
2024-09-12 17:31:18 -07:00
Wenxiang
a480939e8e
[Bugfix] Fix weight loading issue by rename variable. ( #8293 )
2024-09-12 19:25:00 -04:00
Patrick von Platen
d31174a4e1
[Hotfix][Pixtral] Fix multiple images bugs ( #8415 )
2024-09-12 15:21:51 -07:00
Roger Wang
b61bd98f90
[CI/Build] Disable multi-node test for InternVL2 ( #8428 )
2024-09-12 15:05:35 -07:00
Roger Wang
c16369455f
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models ( #8425 )
2024-09-12 14:06:51 -07:00
Alexander Matveev
019877253b
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible ( #8427 )
2024-09-12 21:01:50 +00:00
Nick Hill
551ce01078
[Core] Add engine option to return only deltas or final output ( #7381 )
2024-09-12 12:02:00 -07:00
William Lin
a6c0f3658d
[multi-step] add flashinfer backend ( #7928 )
2024-09-12 11:16:22 -07:00
Joe Runde
f2e263b801
[Bugfix] Offline mode fix ( #8376 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-12 11:11:57 -07:00
Luis Vega
1f0c75afa9
[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser ( #8423 )
2024-09-12 11:10:11 -07:00
WANGWEI
8a23e93302
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance ( #8403 )
2024-09-12 10:47:42 -07:00
Alex Brooks
c6202daeed
[Model] Support multiple images for qwen-vl ( #8247 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-12 10:10:54 -07:00
Isotr0py
e56bf27741
[Bugfix] Fix InternVL2 inference with various num_patches ( #8375 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-12 10:10:35 -07:00
Roger Wang
520ca380ae
[Hotfix][VLM] Fixing max position embeddings for Pixtral ( #8399 )
2024-09-12 09:28:37 -07:00
youkaichao
7de49aa86c
[torch.compile] hide slicing under custom op for inductor ( #8384 )
2024-09-12 00:11:55 -07:00
Woosuk Kwon
42ffba11ad
[Misc] Use RoPE cache for MRoPE ( #8396 )
2024-09-11 23:13:14 -07:00
Kevin Lin
295c4730a8
[Misc] Raise error when using encoder/decoder model with cpu backend ( #8355 )
2024-09-12 05:45:24 +00:00
Blueyo0
1bf2dd9df0
[Gemma2] add bitsandbytes support for Gemma2 ( #8338 )
2024-09-11 21:53:12 -07:00
tomeras91
5a60699c45
[Bugfix]: Fix the logic for deciding if tool parsing is used ( #8366 )
2024-09-12 03:55:30 +00:00
Michael Goin
b6c75e1cf2
Fix the AMD weight loading tests ( #8390 )
2024-09-11 20:35:33 -07:00
Woosuk Kwon
b71c956deb
[TPU] Use Ray for default distributed backend ( #8389 )
2024-09-11 20:31:51 -07:00
youkaichao
f842a7aff1
[misc] remove engine_use_ray ( #8126 )
2024-09-11 18:23:36 -07:00
Cody Yu
a65cb16067
[MISC] Dump model runner inputs when crashing ( #8305 )
2024-09-12 01:12:25 +00:00
Simon Mo
3fd2b0d21c
Bump version to v0.6.1 ( #8379 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-09-11 14:42:11 -07:00
Patrick von Platen
d394787e52
Pixtral ( #8377 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-11 14:41:55 -07:00
Lily Liu
775f00f81e
[Speculative Decoding] Test refactor ( #8317 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-11 14:07:34 -07:00
Aarni Koskela
8baa454937
[Misc] Move device options to a single place ( #8322 )
2024-09-11 13:25:58 -07:00
bnellnm
73202dbe77
[Kernel][Misc] register ops to prevent graph breaks ( #6917 )
...
Co-authored-by: Sage Moore <sage@neuralmagic.com >
2024-09-11 12:52:19 -07:00
Cyrus Leung
7015417fd4
[Bugfix] Add missing attributes in mistral tokenizer ( #8364 )
2024-09-11 11:36:54 -07:00
Alexey Kondratiev(AMD)
aea02f30de
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation ( #8373 )
2024-09-11 18:31:41 +00:00
Li, Jiang
0b952af458
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend ( #7257 )
2024-09-11 09:46:46 -07:00
Yang Fan
3b7fea770f
[Model][VLM] Add Qwen2-VL model support ( #7905 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-11 09:31:19 -07:00
Pooya Davoodi
cea95dfb94
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch ( #8347 )
2024-09-11 05:30:11 +00:00
Yangshen⚡Deng
6a512a00df
[model] Support for Llava-Next-Video model ( #7559 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-10 22:21:36 -07:00
Pavani Majety
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. ( #6112 )
2024-09-11 00:38:40 -04:00
Isotr0py
1230263e16
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel ( #8299 )
2024-09-11 10:11:01 +08:00
Jee Jee Li
e497b8aeff
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models ( #8329 )
2024-09-10 20:59:19 -04:00
Tyler Michael Smith
94144e726c
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag ( #8043 )
2024-09-10 23:51:58 +00:00
William Lin
1d5e397aa4
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers ( #8172 )
2024-09-10 23:46:08 +00:00
Alexander Matveev
22f3a4bc6c
[Bugfix] lookahead block table with cuda graph max capture ( #8340 )
...
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340 )
2024-09-10 16:00:35 -07:00
Cody Yu
b1f3e18958
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled ( #8342 )
2024-09-10 22:28:28 +00:00
Prashant Gupta
04e7c4e771
[Misc] remove peft as dependency for prompt models ( #8162 )
2024-09-10 17:21:56 -04:00
Kevin Lin
5faedf1b62
[Spec Decode] Move ops.advance_step to flash attn advance_step ( #8224 )
2024-09-10 13:18:14 -07:00
sumitd2
02751a7a42
Fix ppc64le buildkite job ( #8309 )
2024-09-10 12:58:34 -07:00
Alexey Kondratiev(AMD)
f421f3cefb
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail ( #8130 )
2024-09-10 11:51:15 -07:00
Cyrus Leung
8c054b7a62
[Frontend] Clean up type annotations for mistral tokenizer ( #8314 )
2024-09-10 16:49:11 +00:00
Daniele
6234385f4a
[CI/Build] enable ccache/scccache for HIP builds ( #8327 )
2024-09-10 08:55:08 -07:00
Cyrus Leung
da1a844e61
[Bugfix] Fix missing post_layernorm in CLIP ( #8155 )
2024-09-10 08:22:50 +00:00
Simon Mo
a1d874224d
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info ( #8319 )
2024-09-09 23:21:00 -07:00
Dipika Sikka
6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ ( #8217 )
2024-09-09 23:02:52 -04:00
Kyle Sayers
c7cb5c3335
[Misc] GPTQ Activation Ordering ( #8135 )
2024-09-09 16:27:26 -04:00
Vladislav Kruglikov
f9b4a2d415
[Bugfix] Correct adapter usage for cohere and jamba ( #8292 )
2024-09-09 11:20:46 -07:00
Adam Lugowski
58fcc8545a
[Frontend] Add progress reporting to run_batch.py ( #8060 )
...
Co-authored-by: Adam Lugowski <adam.lugowski@parasail.io >
2024-09-09 11:16:37 -07:00
Kyle Mistele
08287ef675
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility ( #8272 )
2024-09-09 10:45:11 -04:00
Alexander Matveev
4ef41b8476
[Bugfix] Fix async postprocessor in case of preemption ( #8267 )
2024-09-07 21:01:51 -07:00
Joe Runde
cfe712bf1a
[CI/Build] Use python 3.12 in cuda image ( #8133 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-07 13:03:16 -07:00
sumitd2
b962ee1470
ppc64le: Dockerfile fixed, and a script for buildkite ( #8026 )
2024-09-07 11:18:40 -07:00
Isotr0py
36bf8150cc
[Model][VLM] Decouple weight loading logic for Paligemma ( #8269 )
2024-09-07 17:45:44 +00:00
Isotr0py
e807125936
[Model][VLM] Support multi-images inputs for InternVL2 models ( #8201 )
2024-09-07 16:38:23 +08:00
Cyrus Leung
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test ( #8258 )
2024-09-07 08:02:39 +00:00
youkaichao
ce2702a923
[tpu][misc] fix typo ( #8260 )
2024-09-06 22:40:46 -07:00
Wei-Sheng Chin
795b662cff
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) ( #8241 )
2024-09-06 20:18:16 -07:00
Cyrus Leung
2f707fcb35
[Model] Multi-input support for LLaVA ( #8238 )
2024-09-07 02:57:24 +00:00
Kyle Mistele
41e95c5247
[Bugfix] Fix Hermes tool call chat template bug ( #8256 )
...
Co-authored-by: Kyle Mistele <kyle@constellate.ai >
2024-09-07 10:49:01 +08:00
William Lin
12dd715807
[misc] [doc] [frontend] LLM torch profiler support ( #7943 )
2024-09-06 17:48:48 -07:00
Patrick von Platen
29f49cd6e3
[Model] Allow loading from original Mistral format ( #8168 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-06 17:02:05 -06:00
Dipika Sikka
23f322297f
[Misc] Remove SqueezeLLM ( #8220 )
2024-09-06 16:29:03 -06:00
rasmith
9db52eab3d
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput ( #8248 )
2024-09-06 16:26:09 -06:00
Alexey Kondratiev(AMD)
1447c97e75
[CI/Build] Increasing timeout for multiproc worker tests ( #8203 )
2024-09-06 11:51:03 -07:00
Rui Qiao
de80783b69
[Misc] Use ray[adag] dependency instead of cuda ( #7938 )
2024-09-06 09:18:35 -07:00
afeldman-nm
e5cab71531
[Frontend] Add --logprobs argument to benchmark_serving.py ( #8191 )
2024-09-06 09:01:14 -07:00
Nick Hill
baa5467547
[BugFix] Fix Granite model configuration ( #8216 )
2024-09-06 11:39:29 +08:00
Jiaxin Shan
db3bf7c991
[Core] Support load and unload LoRA in api server ( #6566 )
...
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-09-05 18:10:33 -07:00
sroy745
2febcf2777
[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM ( #7962 )
2024-09-05 16:25:29 -04:00
Michael Goin
2ee45281a5
Move verify_marlin_supported to GPTQMarlinLinearMethod ( #8165 )
2024-09-05 11:09:46 -04:00
Alex Brooks
9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) ( #8029 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-05 12:48:10 +00:00
manikandan.tm@zucisystems.com
8685ba1a1e
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) ( #7860 )
2024-09-05 11:33:37 +00:00
Cyrus Leung
288a938872
[Doc] Indicate more information about supported modalities ( #8181 )
2024-09-05 10:51:53 +00:00
Elfie Guo
e39ebf5cf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. ( #8173 )
2024-09-05 05:12:26 +00:00
Kevin H. Luu
ba262c4e5a
[ci] Mark LoRA test as soft-fail ( #8160 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-04 20:33:12 -07:00
Woosuk Kwon
4624d98dbd
[Misc] Clean up RoPE forward_native ( #8076 )
2024-09-04 20:31:48 -07:00
William Lin
1afc931987
[bugfix] >1.43 constraint for openai ( #8169 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-04 17:35:36 -07:00
Maureen McElaney
e01c2beb7d
[Doc] [Misc] Create CODE_OF_CONDUCT.md ( #8161 )
2024-09-04 16:50:13 -07:00
Simon Mo
32e7db2536
Bump version to v0.6.0 ( #8166 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-09-04 16:34:27 -07:00
Harsha vardhan manoj Bikki
008cf886c9
[Neuron] Adding support for adding/ overriding neuron configuration a… ( #8062 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com >
2024-09-04 16:33:43 -07:00
Cody Yu
77d9e514a2
[MISC] Replace input token throughput with total token throughput ( #8164 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-04 20:23:22 +00:00
Kyle Mistele
e02ce498be
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models ( #5649 )
...
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com >
Co-authored-by: Kyle Mistele <kyle@constellate.ai >
2024-09-04 13:18:13 -07:00
Woosuk Kwon
561d6f8077
[CI] Change test input in Gemma LoRA test ( #8163 )
2024-09-04 13:05:50 -07:00
alexeykondrat
d1dec64243
[CI/Build][ROCm] Enabling LoRA tests on ROCm ( #7369 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-04 11:57:54 -07:00
Cody Yu
2ad2e5608e
[MISC] Consolidate FP8 kv-cache tests ( #8131 )
2024-09-04 18:53:25 +00:00
wnma
d3311562fb
[Bugfix] remove post_layernorm in siglip ( #8106 )
2024-09-04 18:55:37 +08:00
TimWang
ccd7207191
chore: Update check-wheel-size.py to read MAX_SIZE_MB from env ( #8103 )
2024-09-03 23:17:05 -07:00
Cyrus Leung
855c262a6b
[Frontend] Multimodal support in offline chat ( #8098 )
2024-09-04 05:22:17 +00:00
Peter Salas
2be8ec6e71
[Model] Add Ultravox support for multiple audio chunks ( #7963 )
2024-09-04 04:38:21 +00:00
Dipika Sikka
e16fa99a6a
[Misc] Update fbgemmfp8 to use vLLMParameters ( #7972 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-03 20:12:41 -06:00
Woosuk Kwon
61f4a93d14
[TPU][Bugfix] Use XLA rank for persistent cache path ( #8137 )
2024-09-03 18:35:33 -07:00
Nick Hill
d4db9f53c8
[Benchmark] Add --async-engine option to benchmark_throughput.py ( #7964 )
2024-09-03 20:57:41 -04:00
Dipika Sikka
2188a60c7e
[Misc] Update GPTQ to use vLLMParameters ( #7976 )
2024-09-03 17:21:44 -04:00
Simon Mo
dc0b6066ab
[CI] Change PR remainder to avoid at-mentions ( #8134 )
2024-09-03 14:11:42 -07:00
Woosuk Kwon
0af3abe3d3
[TPU][Bugfix] Fix next_token_ids shape ( #8128 )
2024-09-03 13:29:24 -07:00
Kevin H. Luu
f1575dc99f
[ci] Fix GHA workflow ( #8129 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-03 13:25:09 -07:00
tomeras91
c02638efb3
[CI/Build] make pip install vllm work in macos (for import only) ( #8118 )
2024-09-03 12:37:08 -07:00
Antoni Baum
652c83b697
[Misc] Raise a more informative exception in add/remove_logger ( #7750 )
2024-09-03 12:28:25 -07:00
Alexander Matveev
6d646d08a2
[Core] Optimize Async + Multi-step ( #8050 )
2024-09-03 18:50:29 +00:00
Kevin H. Luu
95a178f861
[CI] Only PR reviewers/committers can trigger CI on PR ( #8124 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-03 11:32:27 -07:00
Cody Yu
bd852f2a8b
[Performance] Enable chunked prefill and prefix caching together ( #8120 )
...
Co-authored-by: Tao He <sighingnow@gmail.com >
Co-authored-by: Juelianqvq <Juelianqvq@noreply.github.com >
2024-09-03 10:49:18 -07:00
Isotr0py
ec266536b7
[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend ( #8061 )
2024-09-03 21:37:52 +08:00
Woosuk Kwon
0fbc6696c2
[Bugfix] Fix single output condition in output processor ( #7881 )
2024-09-02 20:35:42 -07:00
wang.yuqi
6e36f4fa6c
improve chunked prefill performance
...
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874 )
2024-09-02 14:20:12 -07:00
Isotr0py
dd2a6a82e3
[Bugfix] Fix internlm2 tensor parallel inference ( #8055 )
2024-09-02 23:48:56 +08:00
Isotr0py
4ca65a9763
[Core][Bugfix] Accept GGUF model without .gguf extension ( #8056 )
2024-09-02 08:43:26 -04:00
Woosuk Kwon
e2b2aa5a0f
[TPU] Align worker index with node boundary ( #7932 )
2024-09-01 23:09:46 -07:00
Lily Liu
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling ( #7244 )
2024-09-01 21:23:29 -07:00
Shawn Tan
f8d60145b4
[Model] Add Granite model ( #7436 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-09-01 18:37:18 -07:00
Roger Wang
5b86b19954
[Misc] Optional installation of audio related packages ( #8063 )
2024-09-01 14:46:57 -07:00
Roger Wang
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items ( #8049 )
2024-08-31 16:35:53 -07:00
Robert Shaw
8423aef4c8
[BugFix][Core] Multistep Fix Crash on Request Cancellation ( #8059 )
2024-08-31 19:44:03 +00:00
Nicolò Lucchesi
4f5d8446ed
[Bugfix] Fix ModelScope models in v0.5.5 ( #8037 )
2024-08-31 00:27:58 -07:00
Cyrus Leung
d05f0a9db2
[Bugfix] Fix import error in Phi-3.5-MoE ( #8052 )
2024-08-30 22:26:55 -07:00
Pavani Majety
622f8abff8
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. ( #8013 )
2024-08-30 22:18:50 -07:00
Wenxiang
1248e8506a
[Model] Adding support for MSFT Phi-3.5-MoE ( #7729 )
...
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Zeqi Lin <zelin@microsoft.com >
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com >
2024-08-30 13:42:57 -06:00
Woosuk Kwon
2684efc467
[TPU][Bugfix] Fix tpu type api ( #8035 )
2024-08-30 09:01:26 -07:00
Kaunil Dhruv
058344f89a
[Frontend]-config-cli-args ( #7737 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com >
2024-08-30 08:21:02 -07:00
Cyrus Leung
98cef6a227
[Core] Increase default max_num_batched_tokens for multimodal models ( #8028 )
2024-08-30 08:20:34 -07:00
Jungho Christopher Cho
f97be32d1d
[VLM][Model] TP support for ViTs ( #7186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-30 08:19:27 -07:00
Cyrus Leung
afd39a4511
[Bugfix] Fix import error in Exaone model ( #8034 )
2024-08-30 08:03:28 -07:00
Richard Liu
2148441fd3
[TPU] Support single and multi-host TPUs on GKE ( #7613 )
2024-08-30 00:27:40 -07:00
Yohan Na
dc13e99348
[MODEL] add Exaone model support ( #7819 )
2024-08-29 23:34:20 -07:00
Avshalom Manevich
34a0e96d46
[Kernel] changing fused moe kernel chunk size default to 32k ( #7995 )
2024-08-30 04:11:39 +00:00
Woosuk Kwon
80c7b089b1
[TPU] Async output processing for TPU ( #8011 )
2024-08-29 19:35:29 -07:00
afeldman-nm
428dd1445e
[Core] Logprobs support in Multi-step ( #7652 )
2024-08-29 19:19:08 -07:00
Cyrus Leung
4abed65c58
[VLM] Disallow overflowing max_model_len for multimodal models ( #7998 )
2024-08-29 17:49:04 -07:00
Wei-Sheng Chin
0c785d344d
Add more percentiles and latencies ( #7759 )
2024-08-29 16:48:11 -07:00
chenqianfzh
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models ( #7445 )
2024-08-29 19:09:08 -04:00
Harsha vardhan manoj Bikki
257afc37c5
[Neuron] Adding support for context-lenght, token-gen buckets. ( #7885 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com >
2024-08-29 13:58:14 -07:00
Dipika Sikka
86a677de42
[misc] update tpu int8 to use new vLLM Parameters ( #7973 )
2024-08-29 16:46:55 -04:00
Isotr0py
d78789ac16
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism ( #7954 )
2024-08-29 15:54:49 -04:00
kushanam
c334b1898b
extend cuda graph size for H200 ( #7894 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-29 12:15:04 -07:00
Pavani Majety
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto ( #7985 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-29 14:53:11 -04:00
Alexander Matveev
3f60f2244e
[Core] Combine async postprocessor and multi-step ( #7921 )
2024-08-29 11:18:26 -07:00
Jonas M. Kübler
f205c09854
[Bugfix] Unify rank computation across regular decoding and speculative decoding ( #7899 )
2024-08-28 22:18:13 -07:00
youkaichao
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." ( #7982 )
2024-08-28 21:27:06 -07:00
Peter Salas
74d5543ec5
[VLM][Core] Fix exceptions on ragged NestedTensors ( #7974 )
2024-08-29 03:24:31 +00:00
youkaichao
a7f65c2be9
[torch.compile] remove reset ( #7975 )
2024-08-28 17:32:26 -07:00
Nick Hill
4289cad37f
[Frontend] Minor optimizations to zmq decoupled front-end ( #7957 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-08-28 17:22:43 -07:00
Michael Goin
af59df0a10
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test ( #7961 )
2024-08-28 19:19:17 -04:00
youkaichao
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead ( #7898 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-08-28 16:10:12 -07:00
bnellnm
3cdfe1f38b
[Bugfix] Make torch registration of punica ops optional ( #7970 )
2024-08-28 16:11:49 -06:00
Mor Zusman
fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM ( #7651 )
2024-08-28 15:06:52 -07:00
Stas Bekman
8c56e57def
[Doc] fix 404 link ( #7966 )
2024-08-28 13:54:23 -07:00
Woosuk Kwon
eeffde1ac0
[TPU] Upgrade PyTorch XLA nightly ( #7967 )
2024-08-28 13:10:21 -07:00
rasmith
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ ( #7386 )
2024-08-28 15:37:47 -04:00
Pavani Majety
b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. ( #7798 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-28 10:01:22 -07:00
Cyrus Leung
ef9baee3c5
[Bugfix][VLM] Fix incompatibility between #7902 and #7230 ( #7948 )
2024-08-28 08:11:18 -07:00
Stas Bekman
98c12cffe5
[Doc] fix the autoAWQ example ( #7937 )
2024-08-28 12:12:32 +00:00
youkaichao
f52a43a8b9
[ci][test] fix pp test failure ( #7945 )
2024-08-28 01:27:07 -07:00
Cody Yu
e3580537a4
[Performance] Enable chunked prefill and prefix caching together ( #7753 )
2024-08-28 00:36:31 -07:00
Alexander Matveev
f508e03e7f
[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) ( #7911 )
2024-08-28 00:02:30 -07:00
Cyrus Leung
51f86bf487
[mypy][CI/Build] Fix mypy errors ( #7929 )
2024-08-27 23:47:44 -07:00
bnellnm
c166e7e43e
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. ( #7886 )
2024-08-27 23:13:45 -04:00
youkaichao
bc6e42a9b1
[hardware][rocm] allow rocm to override default env var ( #7926 )
2024-08-27 19:50:06 -07:00
Peter Salas
fab5f53e2d
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt ( #7902 )
2024-08-28 01:53:56 +00:00
Jonathan Berkhahn
9c71c97ae2
[mypy] Enable mypy type checking for vllm/core ( #7229 )
2024-08-28 07:11:14 +08:00
zifeitong
5340a2dccf
[Model] Add multi-image input support for LLaVA-Next offline inference ( #7230 )
2024-08-28 07:09:02 +08:00
Philipp Schmid
345be0e244
[benchmark] Update TGI version ( #7917 )
2024-08-27 15:07:53 -07:00
Dipika Sikka
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7766 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
2024-08-27 15:07:09 -07:00
youkaichao
ed6f002d33
[cuda][misc] error on empty CUDA_VISIBLE_DEVICES ( #7924 )
2024-08-27 12:06:11 -07:00
Isotr0py
b09c755be8
[Bugfix] Fix phi3v incorrect image_idx when using async engine ( #7916 )
2024-08-27 17:36:09 +00:00
alexeykondrat
42e932c7d4
[CI/Build][ROCm] Enabling tensorizer tests for ROCm ( #7237 )
2024-08-27 10:09:13 -07:00
Kunshang Ji
076169f603
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. ( #7810 )
2024-08-27 10:07:02 -07:00
Isotr0py
9db642138b
[CI/Build][VLM] Cleanup multiple images inputs model test ( #7897 )
2024-08-27 15:28:30 +00:00
Patrick von Platen
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding ( #7739 )
2024-08-27 12:40:02 +00:00
Cody Yu
9606c7197d
Revert #7509 ( #7887 )
2024-08-27 00:16:31 -07:00
youkaichao
64cc644425
[core][torch.compile] discard the compile for profiling ( #7796 )
2024-08-26 21:33:58 -07:00
Nick Hill
39178c7fbc
[Tests] Disable retries and use context manager for openai client ( #7565 )
2024-08-26 21:33:17 -07:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor ( #7049 )
...
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com >
2024-08-26 20:53:20 -07:00
Dipika Sikka
015e6cc252
[Misc] Update compressed tensors lifecycle to remove prefix from create_weights ( #7825 )
2024-08-26 18:09:34 -06:00
omrishiv
760e9f71a8
[Bugfix] neuron: enable tensor parallelism ( #7562 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-08-26 15:13:13 -07:00
youkaichao
05826c887b
[misc] fix custom allreduce p2p cache file generation ( #7853 )
2024-08-26 15:02:25 -07:00
Dipika Sikka
dd9857f5fa
[Misc] Update gptq_marlin_24 to use vLLMParameters ( #7762 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-26 17:44:54 -04:00
Dipika Sikka
665304092d
[Misc] Update qqq to use vLLMParameters ( #7805 )
2024-08-26 13:16:15 -06:00
Cody Yu
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule ( #7822 )
2024-08-26 11:24:53 -07:00
Cyrus Leung
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer ( #7836 )
2024-08-26 05:31:10 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
0b769992ec
[Bugfix]: Use float32 for base64 embedding ( #7855 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2024-08-26 03:16:38 +00:00
Nick Hill
1856aff4d6
[Spec Decoding] Streamline batch expansion tensor manipulation ( #7851 )
2024-08-25 15:45:14 -07:00
youkaichao
70c094ade6
[misc][cuda] improve pynvml warning ( #7852 )
2024-08-25 14:30:09 -07:00
Isotr0py
2059b8d9ca
[Misc] Remove snapshot_download usage in InternVL2 test ( #7835 )
2024-08-25 15:53:09 +00:00
Isotr0py
8aaf3d5347
[Model][VLM] Support multi-images inputs for Phi-3-vision models ( #7783 )
2024-08-25 11:51:20 +00:00
zifeitong
80162c44b1
[Bugfix] Fix Phi-3v crash when input images are of certain sizes ( #7840 )
2024-08-24 18:16:24 -07:00
youkaichao
aab0fcdb63
[ci][test] fix RemoteOpenAIServer ( #7838 )
2024-08-24 17:31:28 +00:00
youkaichao
ea9fa160e3
[ci][test] exclude model download time in server start time ( #7834 )
2024-08-24 01:03:27 -07:00
youkaichao
7d9ffa2ae1
[misc][core] lazy import outlines ( #7831 )
2024-08-24 00:51:38 -07:00
Tyler Rockwood
d81abefd2e
[Frontend] add json_schema support from OpenAI protocol ( #7654 )
2024-08-23 23:07:24 -07:00
Pooya Davoodi
8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API ( #7641 )
2024-08-23 23:04:22 -07:00
Pooya Davoodi
6885fde317
[Bugfix] Fix run_batch logger ( #7640 )
2024-08-23 13:58:26 -07:00
Alexander Matveev
9db93de20c
[Core] Add multi-step support to LLMEngine ( #7789 )
2024-08-23 12:45:53 -07:00
Simon Mo
09c7792610
Bump version to v0.5.5 ( #7823 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-08-23 11:35:33 -07:00
Dipika Sikka
f1df5dbfd6
[Misc] Update marlin to use vLLMParameters ( #7803 )
2024-08-23 14:30:52 -04:00
youkaichao
35ee2ad6b9
[github][misc] promote asking llm first ( #7809 )
2024-08-23 09:38:50 -07:00
Maximilien de Bayser
e25fee57c2
[BugFix] Fix server crash on empty prompt ( #7746 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-08-23 13:12:44 +00:00
Jie Fu (傅杰)
faeddb565d
[misc] Add Torch profiler support for CPU-only devices ( #7806 )
2024-08-23 05:46:25 +00:00
Kunshang Ji
fc5ebbd1d3
[Hardware][Intel GPU] refactor xpu_model_runner for tp ( #7712 )
2024-08-22 20:06:54 -07:00
SangBin Cho
c01a6cb231
[Ray backend] Better error when pg topology is bad. ( #7584 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-22 17:44:25 -07:00
Joe Runde
b903e1ba7f
[Frontend] error suppression cleanup ( #7786 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-22 21:50:21 +00:00
Siyuan Liu
a152246428
[Misc] fix typo in triton import warning ( #7794 )
2024-08-22 13:51:23 -07:00
Kevin H. Luu
666ad0aa16
[ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args ( #7705 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-22 20:10:55 +00:00
Michael Goin
15310b5101
[Bugfix] Use LoadFormat values for vllm serve --load-format ( #7784 )
2024-08-22 11:37:08 -07:00
Peter Salas
57792ed469
[Doc] Fix incorrect docs from #7615 ( #7788 )
2024-08-22 10:02:06 -07:00
Jiaxin Shan
d3b5b98021
[Misc] Enhance prefix-caching benchmark tool ( #6568 )
2024-08-22 09:32:02 -07:00
Travis Johnson
cc0eaf12b1
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output ( #7232 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-22 09:33:48 -04:00
Dipika Sikka
955b5191c9
[Misc] update fp8 to use vLLMParameter ( #7437 )
2024-08-22 08:36:18 -04:00
Lucas Wilkinson
55d63b1211
[Bugfix] Don't build machete on cuda <12.0 ( #7757 )
2024-08-22 08:28:52 -04:00
Flex Wang
4f419c00a6
Fix ShardedStateLoader for vllm fp8 quantization ( #7708 )
2024-08-22 08:25:04 -04:00
Abhinav Goyal
a3fce56b88
[Speculative Decoding] EAGLE Implementation with Top-1 proposer ( #6830 )
2024-08-22 02:42:24 -07:00
Woosuk Kwon
b3856bef7d
[Misc] Use torch.compile for GemmaRMSNorm ( #7642 )
2024-08-22 01:14:13 -07:00
youkaichao
8c6f694a79
[ci] refine dependency for distributed tests ( #7776 )
2024-08-22 00:54:15 -07:00
Woosuk Kwon
eeee1c3b1a
[TPU] Avoid initializing TPU runtime in is_tpu ( #7763 )
2024-08-21 21:31:49 -07:00
Michael Goin
aae74ef95c
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )" ( #7764 )
2024-08-22 03:42:14 +00:00
Joe Runde
cde9183b40
[Bug][Frontend] Improve ZMQ client robustness ( #7443 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-22 02:18:11 +00:00
zifeitong
df1a21131d
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue ( #7710 )
2024-08-22 09:36:24 +08:00
Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce ( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-21 20:18:00 -04:00
Gregory Shtrasberg
9984605412
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility ( #7477 )
...
Co-authored-by: Charlie Fu <Charlie.Fu@amd.com >
2024-08-21 16:47:36 -07:00
youkaichao
7eebe8ccaa
[distributed][misc] error on same VLLM_HOST_IP setting ( #7756 )
2024-08-21 16:25:34 -07:00
Dipika Sikka
8678a69ab5
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
2024-08-21 16:17:10 -07:00
William Lin
5844017285
[ci] [multi-step] narrow multi-step test dependency paths ( #7760 )
2024-08-21 15:52:40 -07:00
Peter Salas
1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig ( #7615 )
2024-08-21 22:49:39 +00:00
William Lin
dd53c4b023
[misc] Add Torch profiler support ( #7451 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-21 15:39:26 -07:00
Robert Shaw
970dfdc01d
[Frontend] Improve Startup Failure UX ( #7716 )
2024-08-21 19:53:01 +00:00
William Lin
91f4522cbf
[multi-step] Raise error if not using async engine ( #7703 )
2024-08-21 11:49:19 -07:00
sasha0552
1b32e02648
[Bugfix] Pass PYTHONPATH from setup.py to CMake ( #7730 )
2024-08-21 11:17:48 -07:00
Robert Shaw
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend ( #7394 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-21 13:34:14 -04:00
Brian Li
d3c002eadc
[Bugfix] chat method add_generation_prompt param ( #7734 )
2024-08-21 17:33:35 +00:00
Nick Hill
9b73a2f498
[Spec Decoding] Use target model max length as default for draft model ( #7706 )
2024-08-22 00:23:22 +08:00
Isotr0py
6925cdbeea
[Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend ( #7735 )
2024-08-21 16:23:03 +00:00
LI MOU
53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] ( #7509 )
2024-08-21 08:54:31 -07:00
Nick Hill
c75363fbc0
[BugFix] Avoid premature async generator exit and raise all exception variations ( #7698 )
2024-08-21 11:45:55 -04:00
sasha0552
dd3fa0e430
[Bugfix] Mirror jinja2 in pyproject.toml ( #7723 )
2024-08-21 13:41:17 +00:00
Cyrus Leung
baaedfdb2d
[mypy] Enable following imports for entrypoints ( #7248 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Fei <dfdfcai4@gmail.com >
2024-08-20 23:28:21 -07:00
Roger Wang
4506641212
[Doc] Section for Multimodal Language Models ( #7719 )
2024-08-20 23:24:01 -07:00
Isotr0py
12e1c65bc9
[Model] Add AWQ quantization support for InternVL2 model ( #7187 )
2024-08-20 23:18:57 -07:00
youkaichao
b74a125800
[ci] try to log process using the port to debug the port usage ( #7711 )
2024-08-20 17:41:12 -07:00
Antoni Baum
66a9e713a7
[Core] Pipe worker_class_fn argument in Executor ( #7707 )
2024-08-21 00:37:39 +00:00
youkaichao
9e51b6a626
[ci][test] adjust max wait time for cpu offloading test ( #7709 )
2024-08-20 17:12:44 -07:00
Kunshang Ji
6e4658c7aa
[Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) ( #7685 )
2024-08-20 12:01:09 -07:00
Antoni Baum
3b682179dd
[Core] Add AttentionState abstraction ( #7663 )
2024-08-20 18:50:45 +00:00
Lucas Wilkinson
c6af027a35
[Misc] Add jinja2 as an explicit build requirement ( #7695 )
2024-08-20 17:17:47 +00:00
Ronen Schaffer
2aa00d59ad
[CI/Build] Pin OpenTelemetry versions and make errors clearer ( #7266 )
...
[CI/Build] Pin OpenTelemetry versions and make a availability errors clearer (#7266 )
2024-08-20 10:02:21 -07:00
Kunshang Ji
c42590f97a
[Hardware] [Intel GPU] refactor xpu worker/executor ( #7686 )
2024-08-20 09:54:10 -07:00
Isotr0py
aae6927be0
[VLM][Model] Add test for InternViT vision encoder ( #7409 )
2024-08-20 23:10:20 +08:00
Ilya Lavrenov
398521ad19
[OpenVINO] Updated documentation ( #7687 )
2024-08-20 07:33:56 -06:00
Lucas Wilkinson
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
Kunshang Ji
b6f99a6ffe
[Core] Refactor executor classes for easier inheritance ( #7673 )
...
[Core] Refactor executor classes to make it easier to inherit GPUExecutor (#7673 )
2024-08-20 00:56:50 -07:00
youkaichao
ad28a74beb
[misc][cuda] add warning for pynvml user ( #7675 )
2024-08-20 00:35:09 -07:00
jianyizh
e6d811dd13
[XPU] fallback to native implementation for xpu custom op ( #7670 )
2024-08-20 00:26:09 -07:00
youkaichao
c4be16e1a7
[misc] add nvidia related library in collect env ( #7674 )
2024-08-19 23:22:49 -07:00
Kuntai Du
3d8a5f063d
[CI] Organizing performance benchmark files ( #7616 )
2024-08-19 22:43:54 -07:00
Zijian Hu
f4fc7337bf
[Bugfix] support tie_word_embeddings for all models ( #5724 )
2024-08-19 20:00:04 -07:00
Kevin H. Luu
0df7ec0b2d
[ci] Install Buildkite test suite analysis ( #7667 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-19 19:55:04 -07:00
Abhinav Goyal
312f761232
[Speculative Decoding] Fixing hidden states handling in batch expansion ( #7508 )
2024-08-19 17:58:14 -07:00
youkaichao
e54ebc2f8f
[doc] fix doc build error caused by msgspec ( #7659 )
2024-08-19 17:50:59 -07:00
Travis Johnson
67e02fa8a4
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding ( #7665 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-20 00:43:09 +00:00
Woosuk Kwon
43735bf5e1
[TPU] Remove redundant input tensor cloning ( #7660 )
2024-08-19 15:55:04 -07:00
Andrew Song
da115230fd
[Bugfix] Don't disable existing loggers ( #7664 )
2024-08-19 15:11:58 -07:00
Isotr0py
7601cb044d
[Core] Support tensor parallelism for GGUF quantization ( #7520 )
2024-08-19 17:30:14 -04:00
William Lin
47b65a5508
[core] Multi Step Scheduling ( #7000 )
...
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
2024-08-19 13:52:13 -07:00
Ali Panahi
dad961ef5c
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 ( #5428 )
2024-08-19 20:47:00 +00:00
Cody Yu
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics ( #7606 )
2024-08-19 11:52:07 -07:00
Woosuk Kwon
df845b2b46
[Misc] Remove Gemma RoPE ( #7638 )
2024-08-19 09:29:31 -07:00
Kunshang Ji
1a36287b89
[Bugfix] Fix xpu build ( #7644 )
2024-08-18 22:00:09 -07:00
Peng Guanwen
f710fb5265
[Core] Use flashinfer sampling kernel when available ( #7137 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-19 03:24:03 +00:00
SangBin Cho
ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization ( #7109 )
2024-08-18 17:57:20 -07:00
Woosuk Kwon
200a2ffa6b
[Misc] Refactor Llama3 RoPE initialization ( #7637 )
2024-08-18 17:18:12 -07:00
Alex Brooks
40e1360bb6
[CI/Build] Add text-only test for Qwen models ( #7475 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-08-19 07:43:46 +08:00
Robert Shaw
e3b318216d
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend ( #7279 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-18 20:19:48 +00:00
Woosuk Kwon
ab7165f2c7
[TPU] Optimize RoPE forward_native2 ( #7636 )
2024-08-18 01:15:10 -07:00
Woosuk Kwon
0c2fa50b84
[TPU] Use mark_dynamic only for dummy run ( #7634 )
2024-08-18 00:18:53 -07:00
Woosuk Kwon
ce143353c6
[TPU] Skip creating empty tensor ( #7630 )
2024-08-17 14:22:46 -07:00
Roger Wang
bbf55c4805
[VLM] Refactor MultiModalConfig initialization and profiling ( #7530 )
2024-08-17 13:30:55 -07:00
Jee Jee Li
1ef13cf92f
[Misc]Fix BitAndBytes exception messages ( #7626 )
2024-08-17 12:02:14 -07:00
youkaichao
832163b875
[ci][test] allow longer wait time for api server ( #7629 )
2024-08-17 11:26:38 -07:00
Besher Alkurdi
e73f76eec6
[Model] Pipeline parallel support for JAIS ( #7603 )
2024-08-17 11:11:09 -07:00
youkaichao
d95cc0a55c
[core][misc] update libcudart finding ( #7620 )
...
Co-authored-by: cjackal <44624812+cjackal@users.noreply.github.com >
2024-08-16 23:01:35 -07:00
youkaichao
5bf45db7df
[ci][test] fix engine/logger test ( #7621 )
2024-08-16 23:00:59 -07:00
youkaichao
eed020f673
[misc] use nvml to get consistent device name ( #7582 )
2024-08-16 21:15:13 -07:00
Xander Johnson
7c0b7ea214
[Bugfix] add >= 1.0 constraint for openai dependency ( #7612 )
2024-08-16 20:56:01 -07:00
SangBin Cho
4706eb628e
[aDAG] Unflake aDAG + PP tests ( #7600 )
2024-08-16 20:49:30 -07:00
Rui Qiao
bae888cb8e
[Bugfix] Clear engine reference in AsyncEngineRPCServer ( #7618 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-16 20:44:05 -07:00
Alexei-V-Ivanov-AMD
6bd19551b0
.[Build/CI] Enabling passing AMD tests. ( #7610 )
2024-08-16 20:25:32 -07:00
bnellnm
e680349994
[Bugfix] Fix custom_ar support check ( #7617 )
2024-08-16 19:05:49 -07:00
Michael Goin
44f26a9466
[Model] Align nemotron config with final HF state and fix lm-eval-small ( #7611 )
2024-08-16 15:56:34 -07:00
bnellnm
37fd47e780
[Kernel] fix types used in aqlm and ggml kernels to support dynamo ( #7596 )
2024-08-16 14:00:11 -07:00
bnellnm
7759ae958f
[Kernel][Misc] dynamo support for ScalarType ( #7594 )
2024-08-16 13:59:49 -07:00
bnellnm
9f69856356
[Kernel] register punica functions as torch ops ( #7591 )
2024-08-16 13:59:38 -07:00
Michael Goin
d4f0f17b02
[Doc] Update quantization supported hardware table ( #7595 )
2024-08-16 13:59:27 -07:00
Michael Goin
b3f4e17935
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints ( #7444 )
2024-08-16 13:59:16 -07:00
Mahesh Keralapura
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 ( #7440 )
...
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440 )
2024-08-16 13:46:01 -07:00
William Lin
f366f6339b
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend ( #7571 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-16 11:41:56 -07:00
Michael Goin
855866caa9
[Kernel] Add tuned triton configs for ExpertsInt8 ( #7601 )
2024-08-16 11:37:01 -07:00
Mor Zusman
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE ( #7415 )
2024-08-16 10:06:51 -07:00
Charlie Fu
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm ( #7210 )
2024-08-16 10:06:30 -07:00
fzyzcjy
ec724a725e
support tqdm in notebooks ( #7510 )
2024-08-16 09:17:50 -07:00
Gordon Wong
0e39a33c6d
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method ( #7513 )
2024-08-16 10:05:18 -06:00
Kuntai Du
6fc5b0f249
[CI] Fix crashes of performance benchmark ( #7500 )
2024-08-16 08:08:45 -07:00
Nick Hill
9587b050fb
[Core] Use uvloop with zmq-decoupled front-end ( #7570 )
2024-08-15 22:48:07 -07:00
youkaichao
54bd9a03c4
register custom op for flash attn and use from torch.ops ( #7536 )
2024-08-15 22:38:56 -07:00
jon-chuang
50b8d08dbd
[Misc/Testing] Use torch.testing.assert_close ( #7324 )
2024-08-16 04:24:04 +00:00
Michael Goin
e165528778
[CI] Move quantization cpu offload tests out of fastcheck ( #7574 )
2024-08-15 21:16:20 -07:00
nunjunj
3b19e39dc5
Chat method for offline llm ( #5049 )
...
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-08-15 19:41:34 -07:00
youkaichao
4cd7d47fed
[ci/test] rearrange tests and make adag test soft fail ( #7572 )
2024-08-15 19:39:04 -07:00
Grant Pinkert
f878c8feb0
[Feature]: Add OpenAI server prompt_logprobs support #6508 ( #7453 )
2024-08-16 02:38:08 +00:00
shangmingc
b67ae00cdb
[Misc] Add quantization config support for speculative model. ( #7343 )
2024-08-15 19:34:28 -07:00
Michael Goin
9c8e2d1161
[Bugfix][Harmless] Fix float16 dtype for model_is_embedding ( #7566 )
2024-08-15 18:26:19 -07:00
Michael Goin
21313e09e3
[Bugfix] Fix default weight loading for scalars ( #7534 )
2024-08-15 13:10:22 -07:00
PHILO-HE
f4da5f7b6d
[Misc] Update dockerfile for CPU to cover protobuf installation ( #7182 )
2024-08-15 10:03:01 -07:00
omrishiv
9c1f78d5d6
[Bugfix] update neuron for version > 0.5.0 ( #7175 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-15 09:44:14 -07:00
Woosuk Kwon
fc93e56143
[Bugfix][TPU] Correct env variable for XLA cache path ( #7544 )
2024-08-15 00:02:29 -07:00
Kameshwara Pavan Kumar Mantha
22b39e11f2
llama_index serving integration documentation ( #6973 )
...
Co-authored-by: pavanmantha <pavan.mantha@thevaslabs.io >
2024-08-14 15:38:37 -07:00
Kyle Sayers
f55a9aea45
[Misc] Revert compressed-tensors code reuse ( #7521 )
2024-08-14 15:07:37 -07:00
Woosuk Kwon
951fdd66d3
[TPU] Set per-rank XLA cache ( #7533 )
2024-08-14 14:47:51 -07:00
William Lin
2ecf7b1757
[core] [3/N] multi-step args and sequence.py ( #7452 )
2024-08-14 12:32:45 -07:00
Cyrus Leung
3f674a49b5
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt ( #7126 )
2024-08-14 17:55:42 +00:00
Wallas Henrique
70b746efcf
[Misc] Deprecation Warning when setting --engine-use-ray ( #7424 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-14 09:44:27 -07:00
jack
67d115db08
[Bugfix][Frontend] Disable embedding API for chat models ( #7504 )
...
Co-authored-by: jack <jack@alex>
2024-08-14 09:15:19 -07:00
youkaichao
d3d9cb6e4b
[ci] fix model tests ( #7507 )
2024-08-14 01:01:43 -07:00
Chang Su
c134a46402
Fix empty output when temp is too low ( #2937 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-08-14 05:31:44 +00:00
youkaichao
199adbb7cf
[doc] update test script to include cudagraph ( #7501 )
2024-08-13 21:52:58 -07:00
Cyrus Leung
dd164d72f3
[Bugfix][Docs] Update list of mock imports ( #7493 )
2024-08-13 20:37:30 -07:00
youkaichao
ea49e6a3c8
[misc][ci] fix cpu test with plugins ( #7489 )
2024-08-13 19:27:46 -07:00
Jee Jee Li
97992802f3
[CI/Build]Reduce the time consumption for LoRA tests ( #7396 )
2024-08-13 17:27:29 -07:00
Woosuk Kwon
59edd0f134
[Bugfix][CI] Import ray under guard ( #7486 )
2024-08-13 17:12:58 -07:00
Woosuk Kwon
a08df8322e
[TPU] Support multi-host inference ( #7457 )
2024-08-13 16:31:20 -07:00
youkaichao
16422ea76f
[misc][plugin] add plugin system implementation ( #7426 )
2024-08-13 16:24:17 -07:00
Kyle Sayers
373538f973
[Misc] compressed-tensors code reuse ( #7277 )
2024-08-13 19:05:15 -04:00
youkaichao
33e5d7e6b6
[frontend] spawn engine process from api server process ( #7484 )
2024-08-13 15:40:17 -07:00
Simon Mo
c5c7768264
Announce NVIDIA Meetup ( #7483 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-08-13 14:28:36 -07:00
Dipika Sikka
b1e5afc3e7
[Misc] Update awq and awq_marlin to use vLLMParameters ( #7422 )
2024-08-13 17:08:20 -04:00
Dipika Sikka
d3bdfd3ab9
[Misc] Update Fused MoE weight loading ( #7334 )
2024-08-13 14:57:45 -04:00
Dipika Sikka
fb377d7e74
[Misc] Update gptq_marlin to use new vLLMParameters ( #7281 )
2024-08-13 14:30:11 -04:00
Dipika Sikka
181abbc27d
[Misc] Update LM Eval Tolerance ( #7473 )
2024-08-13 14:28:14 -04:00
Peter Salas
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models ( #7446 )
2024-08-13 17:39:33 +00:00
Woosuk Kwon
e20233d361
Revert "[Doc] Update supported_hardware.rst ( #7276 )" ( #7467 )
2024-08-13 01:37:08 -07:00
Woosuk Kwon
d6e634f3d7
[TPU] Suppress import custom_ops warning ( #7458 )
2024-08-13 00:30:30 -07:00
youkaichao
4d2dc5072b
[hardware] unify usage of is_tpu to current_platform.is_tpu() ( #7102 )
2024-08-13 00:16:42 -07:00
Cyrus Leung
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 ( #7410 )
2024-08-13 05:33:41 +00:00
Kevin H. Luu
5469146bcc
[ci] Remove fast check cancel workflow ( #7455 )
2024-08-12 21:19:51 -07:00
Andrew Wang
97a6be95ba
[Misc] improve logits processors logging message ( #7435 )
2024-08-13 02:29:34 +00:00
Cyrus Leung
9ba85bc152
[mypy] Misc. typing improvements ( #7417 )
2024-08-13 09:20:20 +08:00
Rui Qiao
198d6a2898
[Core] Shut down aDAG workers with clean async llm engine exit ( #7224 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-12 17:57:16 -07:00
Daniele
774cd1d3bf
[CI/Build] bump minimum cmake version ( #6999 )
2024-08-12 16:29:20 -07:00
sasha0552
91294d56e1
[Bugfix] Handle PackageNotFoundError when checking for xpu version ( #7398 )
2024-08-12 16:07:20 -07:00
jon-chuang
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel ( #7208 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-12 22:47:41 +00:00
Cyrus Leung
4ddc4743d7
[Core] Consolidate GB constant and enable float GB arguments ( #7416 )
2024-08-12 14:14:14 -07:00
Lucas Wilkinson
6aa33cb2dd
[Misc] Use scalar type to dispatch to different gptq_marlin kernels ( #7323 )
2024-08-12 14:40:13 -04:00
Kevin H. Luu
1137f343aa
[ci] Cancel fastcheck when PR is ready ( #7433 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:59:14 -07:00
Kevin H. Luu
9b3e2edd30
[ci] Cancel fastcheck run when PR is marked ready ( #7427 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:56:52 -07:00
Kevin H. Luu
65950e8f58
[ci] Entrypoints run upon changes in vllm/ ( #7423 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:18:03 -07:00
Woosuk Kwon
cfba4def5d
[Bugfix] Fix logit soft cap in flash-attn backend ( #7425 )
2024-08-12 09:58:28 -07:00
Daniele
d2bc4510a4
[CI/Build] bump Dockerfile.neuron image base, use public ECR ( #6832 )
2024-08-12 09:53:35 -07:00
Cyrus Leung
24154f8618
[Frontend] Disallow passing model as both argument and option ( #7347 )
2024-08-12 12:58:34 +00:00
Roger Wang
e6e42e4b17
[Core][VLM] Support image embeddings as input ( #6613 )
2024-08-12 16:16:06 +08:00
Lily Liu
ec2affa8ae
[Kernel] Flashinfer correctness fix for v0.1.3 ( #7319 )
2024-08-12 07:59:17 +00:00
Roger Wang
86ab567bae
[CI/Build] Minor refactoring for vLLM assets ( #7407 )
2024-08-12 02:41:52 +00:00
Simon Mo
f020a6297e
[Docs] Update readme ( #7316 )
2024-08-11 17:13:37 -07:00
youkaichao
6c8e595710
[misc] add commit id in collect env ( #7405 )
2024-08-11 15:40:48 -07:00
tomeras91
02b1988b9f
[Doc] building vLLM with VLLM_TARGET_DEVICE=empty ( #7403 )
2024-08-11 14:38:17 -07:00
tomeras91
386087970a
[CI/Build] build on empty device for better dev experience ( #4773 )
2024-08-11 13:09:44 -07:00
William Lin
c08e2b3086
[core] [2/N] refactor worker_base input preparation for multi-step ( #7387 )
2024-08-11 08:50:08 -07:00
Noam Gat
4fb7b52a2c
Updating LM Format Enforcer version to v0.10.6 ( #7189 )
2024-08-11 08:11:50 -04:00
Woosuk Kwon
90bab18f24
[TPU] Use mark_dynamic to reduce compilation time ( #7340 )
2024-08-10 18:12:22 -07:00
Isotr0py
4c5d8e8ea9
[Bugfix] Fix phi3v batch inference when images have different aspect ratio ( #7392 )
2024-08-10 16:19:33 +00:00
Cade Daniel
baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 ( #7380 )
2024-08-09 23:48:49 +00:00
Antoni Baum
999ef0b917
[Misc] Add numpy implementation of compute_slot_mapping ( #7377 )
2024-08-09 22:52:29 +00:00
Dipika Sikka
5c6c54d67a
[Bugfix] Fix PerTensorScaleParameter weight loading for fused models ( #7376 )
2024-08-09 21:23:46 +00:00
Mahesh Keralapura
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time ( #7089 )
2024-08-09 13:55:13 -07:00
Roger Wang
70d268a399
[Bugfix] Fix ITL recording in serving benchmark ( #7372 )
2024-08-09 10:00:00 -07:00
Pooya Davoodi
249b88228d
[Frontend] Support embeddings in the run_batch API ( #7132 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-09 09:48:21 -07:00
Alexander Matveev
74af2bbd90
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder ( #7360 )
2024-08-09 16:35:49 +00:00
Alexander Matveev
fc7b8d1eef
[Performance] e2e overheads reduction: Small followup diff ( #7364 )
2024-08-09 15:49:36 +00:00
Isotr0py
67abdbb42f
[VLM][Doc] Add stop_token_ids to InternVL example ( #7354 )
2024-08-09 14:51:04 +00:00
Mor Zusman
07ab160741
[Model][Jamba] Mamba cache single buffer ( #6739 )
...
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-08-09 10:07:06 -04:00
Nick Hill
b4e9528f95
[Core] Streamline stream termination in AsyncLLMEngine ( #7336 )
2024-08-09 07:06:36 +00:00
William Lin
57b7be0e1c
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace ( #6971 )
2024-08-09 05:42:45 +00:00
Travis Johnson
99b4cf5f23
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary ( #7218 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-08 22:08:46 -07:00
Alexander Matveev
e02ac55617
[Performance] Optimize e2e overheads: Reduce python allocations ( #7162 )
2024-08-08 21:34:28 -07:00
Woosuk Kwon
73388c07a4
[TPU] Fix dockerfile.tpu ( #7331 )
2024-08-08 20:24:58 -07:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models ( #7258 )
2024-08-09 10:39:41 +08:00
Siyuan Liu
0fa14907da
[TPU] Add Load-time W8A16 quantization for TPU Backend ( #7005 )
2024-08-08 18:35:49 -07:00
Simon Mo
5923532e15
Add Skywork AI as Sponsor ( #7314 )
2024-08-08 13:59:57 -07:00
Jee Jee Li
a049b107e2
[Misc] Temporarily resolve the error of BitAndBytes ( #7308 )
2024-08-08 13:42:58 -07:00
Isotr0py
8334c39f37
[Bugfix] Fix new Llama3.1 GGUF model loading ( #7269 )
2024-08-08 13:42:44 -07:00
Daniele
e904576743
[CI/Build] Dockerfile.cpu improvements ( #7298 )
2024-08-08 15:24:52 -04:00
Michael Goin
e14fb22e59
[Doc] Put collect_env issue output in a <detail> block ( #7310 )
2024-08-08 11:22:49 -07:00
Zach Zheng
782e53ab59
[Bugfix][fast] Fix the get_num_blocks_touched logic ( #6849 )
2024-08-08 10:43:30 -07:00
Joe Runde
21b9c49aa3
[Frontend] Kill the server on engine death ( #6594 )
...
Signed-off-by: Joe Runde <joe@joerun.de >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-08 09:47:48 -07:00
Luka Govedič
5fb4a3f678
[Bugfix][Kernel] Increased atol to fix failing tests ( #7305 )
2024-08-08 12:16:13 -04:00
Jee Jee Li
757ac70a64
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 ( #7273 )
2024-08-08 14:02:41 +00:00
Murali Andoorveedu
6dffa4b0a6
[Bugfix] Fix LoRA with PP ( #7292 )
2024-08-08 00:02:27 -07:00
Cherilyn Buren
48abee9e54
[Frontend] remove max_num_batched_tokens limit for lora ( #7288 )
2024-08-08 06:17:29 +00:00
Rui Qiao
746709642c
[Misc] Fix typos in scheduler.py ( #7285 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-07 17:06:01 -07:00
Lily Liu
e53dfd3eaf
[Kernel] Fix Flashinfer Correctness ( #7284 )
2024-08-07 16:26:52 -07:00
Michael Goin
6d94420246
[Doc] Update supported_hardware.rst ( #7276 )
2024-08-07 14:21:50 -07:00
Nick Hill
fc1493a01e
[FrontEnd] Make merge_async_iterators is_cancelled arg optional ( #7282 )
2024-08-07 13:35:14 -07:00
Lucas Wilkinson
311f743831
[Bugfix] Fix gptq failure on T4s ( #7264 )
2024-08-07 20:05:37 +00:00
Kevin H. Luu
469b3bc538
[ci] Make building wheels per commit optional ( #7278 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-07 11:34:25 -07:00
Michael Goin
5223199e03
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization ( #7219 )
2024-08-07 11:23:12 -07:00
Maximilien de Bayser
fde47d3bc2
[BugFix] Fix frontend multiprocessing hang ( #7217 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-08-07 18:09:36 +00:00
Stas Bekman
0e12cd67a8
[Doc] add online speculative decoding example ( #7243 )
2024-08-07 09:58:02 -07:00
Ilya Lavrenov
80cbe10c59
[OpenVINO] migrate to latest dependencies versions ( #7251 )
2024-08-07 09:49:10 -07:00
Isotr0py
b764547616
[Bugfix] Fix input processor for InternVL2 model ( #7164 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-07 09:32:07 -07:00
Rafael Vasquez
ab0f5e2823
Fixes typo in function name ( #7275 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-08-07 09:29:27 -07:00
Robert Shaw
564985729a
[ BugFix ] Move zmq frontend to IPC instead of TCP ( #7222 )
2024-08-07 16:24:56 +00:00
Dipika Sikka
0f7052bc7e
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 ( #5874 )
2024-08-07 09:17:58 -07:00
youkaichao
639159b2a6
[distributed][misc] add specialized method for cuda platform ( #7249 )
2024-08-07 08:54:52 -07:00
Cyrus Leung
66d617e343
[Frontend] Gracefully handle missing chat template and fix CI failure ( #7238 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-07 09:12:05 +00:00
Atilla Akkuş
7b261092de
[BUGFIX]: top_k is expected to be an integer. ( #7227 )
2024-08-07 00:32:16 -07:00
Roger Wang
2385c8f374
[Doc] Mock new dependencies for documentation ( #7245 )
2024-08-07 06:43:03 +00:00
Nick Hill
9a3f49ae07
[BugFix] Overhaul async request cancellation ( #7111 )
2024-08-07 13:21:41 +08:00
Michael Goin
f9a5600649
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading ( #7225 )
2024-08-06 18:34:26 -07:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) ( #4942 )
...
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-06 16:51:47 -04:00
xiaobochen123
660470e5a3
[Core] Optimize evictor-v2 performance ( #7193 )
2024-08-06 12:34:25 -07:00
Luka Govedič
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues ( #5941 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-08-06 18:17:08 +00:00
Lily Liu
5c60c8c423
[SpecDecode] [Minor] Fix spec decode sampler tests ( #7183 )
2024-08-06 10:40:32 -07:00
Katarzyna Papis
00afc78590
[Bugfix] add gguf dependency ( #7198 )
...
Co-authored-by: katarzyna.papis <kpapis@kpapis-u20.sclab.intel.com >
2024-08-06 10:08:35 -07:00
Robert Shaw
541c1852d3
[ BugFix ] Fix ZMQ when VLLM_PORT is set ( #7205 )
2024-08-06 09:26:26 -07:00
Dipika Sikka
a3bbbfa1d8
[BugFix] Fix DeepSeek remote code ( #7178 )
2024-08-06 08:16:53 -07:00
Cyrus Leung
1f26efbb3a
[Model] Support SigLIP encoder and alternative decoders for LLaVA models ( #7153 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-08-06 16:55:31 +08:00
Jee Jee Li
9118217f58
[LoRA] Relax LoRA condition ( #7146 )
2024-08-06 01:57:25 +00:00
Simon Mo
e3c664bfcb
[Build] Add initial conditional testing spec ( #6841 )
2024-08-05 17:39:22 -07:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-05 17:54:23 -06:00
Cody Yu
ef527be06c
[MISC] Use non-blocking transfer in prepare_input ( #7172 )
2024-08-05 23:41:27 +00:00
Jacob Schein
89b8db6bb2
[Bugfix] Specify device when loading LoRA and embedding tensors ( #7129 )
...
Co-authored-by: Jacob Schein <jacobschein@Jacobs-MacBook-Pro-2.local >
2024-08-05 16:35:47 -07:00
Thomas Parnell
789937af2e
[Doc] [SpecDecode] Update MLPSpeculator documentation ( #7100 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-08-05 23:29:43 +00:00
youkaichao
dfb1a15dcb
[ci][frontend] deduplicate tests ( #7101 )
2024-08-05 15:59:22 -07:00