Richard Zou
|
7f58fb9718
|
Add assertion for no objects while hashing hf_config (#16930)
Signed-off-by: rzou <zou3519@gmail.com>
|
2025-04-22 09:32:22 -07:00 |
|
vllmellm
|
30bc3e0f66
|
[FEAT][ROCm]: Support AITER MLA (#15893)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: qli88 <qiang.li2@amd.com>
|
2025-04-22 09:31:13 -07:00 |
|
Reid
|
f34410715f
|
[frontend] enhance tool_calls type check (#16882)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
|
2025-04-22 15:40:24 +00:00 |
|
Chauncey
|
68d4c33202
|
[Misc] Add S3 environment variables for better support of MinIO. (#16977)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
2025-04-22 14:27:36 +00:00 |
|
Zhengyuan Su (苏政渊)
|
f961d7f6ef
|
[BugFix] Pass in correct VLLM config in FlashInfer backend (#13207) (#16973)
Signed-off-by: 苏政渊 <suzhengyuan@moonshot.cn>
Co-authored-by: 苏政渊 <suzhengyuan@moonshot.cn>
|
2025-04-22 06:44:10 -07:00 |
|
Harry Mellor
|
d059110498
|
Improve configs - SpeculativeConfig (#16971)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-04-22 12:55:36 +00:00 |
|
Yang Fan
|
571e8dd65e
|
[Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni (#16974)
Signed-off-by: fyabc <suyang.fy@alibaba-inc.com>
|
2025-04-22 12:23:17 +00:00 |
|
Reid
|
4b91c927f6
|
[Misc] refactor example series (#16972)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
|
2025-04-22 11:44:21 +00:00 |
|
vllmellm
|
0e237f0035
|
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER (#15001)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2025-04-22 02:46:28 -07:00 |
|
Cyrus Leung
|
8f7bace7c3
|
[Doc] Improve documentation for multimodal CLI args (#16960)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-04-22 08:35:35 +00:00 |
|
Nick Hill
|
e4d6144232
|
[BugFix] Fix incremental detokenization perf issue (#16963)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-04-22 08:16:19 +00:00 |
|
Lei Wang
|
8d32dc603d
|
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
Co-authored-by: xinyuxiao <xinyuxiao2024@gmail.com>
|
2025-04-22 09:01:36 +01:00 |
|
Woosuk Kwon
|
c4ab9f3e71
|
[V1] Remove pre-allocation for KV cache (#16941)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-04-22 00:52:18 -07:00 |
|
Flora Feng
|
2689d5c027
|
[Model] Use autoweightloader for mamba (#16950)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
|
2025-04-22 07:48:15 +00:00 |
|
Chauncey
|
acba33a0f1
|
[Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams (#16767)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
|
2025-04-22 06:02:20 +00:00 |
|
SnowCharm
|
a114bf20a3
|
[Perf] Optimize _update_states for GPU model runner (#16910)
Signed-off-by: snowcharm <snowcharmqq@gmail.com>
|
2025-04-22 14:01:54 +08:00 |
|
Michael Yao
|
3097ce3a32
|
[Doc] Update ai_accelerator/hpu-gaudi.inc.md (#16956)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
|
2025-04-22 05:33:27 +00:00 |
|
Cyrus Leung
|
d6da9322c8
|
[Bugfix] Fix f-string for Python 3.9-3.11 (#16962)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-04-21 21:45:55 -07:00 |
|
omer-dayan
|
71ce44047f
|
Support S3 Sharded loading with RunAI Model Streamer (#16317)
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-04-21 21:21:49 -07:00 |
|
Charlie Fu
|
188b7f9b8c
|
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm (#15830)
Signed-off-by: charlifu <charlifu@amd.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
|
2025-04-21 20:46:22 -07:00 |
|
wangxiyuan
|
b9b4746950
|
[V1] Remove additional_config check (#16710)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2025-04-21 20:45:27 -07:00 |
|
Varun Sundar Rabindranath
|
7b8a2ab76f
|
[Kernel] Add expert_map support to Cutlass FP8 MOE (#16861)
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>
Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com>
|
2025-04-21 20:44:32 -07:00 |
|
Jee Jee Li
|
c9acbf1141
|
[Misc] Remove the chunked prefill warning for LoRA (#16925)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-04-21 20:44:24 -07:00 |
|
kliuae
|
5b794cae8d
|
[ROCm] Add aiter tkw1 kernel for Llama4 fp8 (#16727)
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
|
2025-04-21 20:42:34 -07:00 |
|
Jeffrey Li
|
0e4254492f
|
[Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other (#16863)
Signed-off-by: Jeffrey Li <jeffrey.dot.li@gmail.com>
|
2025-04-22 11:40:19 +08:00 |
|
Woosuk Kwon
|
1311913f55
|
[BugFix][Spec Decode] No in-place update to draft probs (#16952)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-04-21 19:54:19 -07:00 |
|
Cyrus Leung
|
29f395c97c
|
[Doc] Remove unnecessary V1 flag (#16924)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-04-21 21:04:38 -04:00 |
|
Nicolò Lucchesi
|
fa3bba2a53
|
[TPU][V1] Enable Top-P (#16843)
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-04-22 00:46:07 +00:00 |
|
Michael Goin
|
986537f1c3
|
[V1] V1 FlashInfer Attention (#16684)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Aurick Qiao <qiao@aurick.net>
|
2025-04-22 00:38:41 +00:00 |
|
Nicolò Lucchesi
|
210207525e
|
[TPU][V1] Capture multimodal encoder during model compilation (#15051)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
|
2025-04-21 18:36:59 -06:00 |
|
Michael Goin
|
71eda0bb76
|
Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml (#16946)
|
2025-04-21 18:35:32 -06:00 |
|
Chengji Yao
|
471fe65630
|
[TPU][V1] Implicitly adjust page size when there's SMEM OOM (#16871)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
|
2025-04-21 15:43:13 -06:00 |
|
Woosuk Kwon
|
3a0fba5cf4
|
[V1][Spec Decode] Handle draft tokens beyond max_model_len (#16087)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-04-21 12:38:50 -07:00 |
|
Chanh Nguyen
|
299ebb62b2
|
[Core] Speed up decode by remove synchronizing operation in sampler (#16436)
Signed-off-by: Chanh Nguyen <cnguyen@linkedin.com>
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com>
|
2025-04-21 18:18:22 +00:00 |
|
David Xia
|
f728ab8e35
|
[Doc] mention how to install in CPU editable mode (#16923)
Signed-off-by: David Xia <david@davidxia.com>
|
2025-04-21 17:45:51 +00:00 |
|
David Xia
|
63e26fff78
|
[doc] install required python3-dev apt package (#16888)
Signed-off-by: David Xia <david@davidxia.com>
|
2025-04-21 16:15:18 +00:00 |
|
Yan Ma
|
fe3462c774
|
[XPU][Bugfix] minor fix for XPU (#15591)
Signed-off-by: yan ma <yan.ma@intel.com>
|
2025-04-22 00:02:57 +08:00 |
|
Kartik Ramesh
|
3b34fd5273
|
Raise error for data-parallel with benchmark_throughput (#16737)
Signed-off-by: Kartik Ramesh <kartikx2000@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2025-04-21 23:51:43 +08:00 |
|
Isotr0py
|
55d6d3fdb8
|
[Bugfix] Fix GLM rotary_dim issue and support v1 (#16912)
Signed-off-by: isotr0py <2037008807@qq.com>
|
2025-04-21 14:26:34 +00:00 |
|
Shanshan Shen
|
7272bfae77
|
[Misc] Refactor platform to get device specific stream and event (#14411)
Signed-off-by: shen-shanshan <467638484@qq.com>
|
2025-04-21 21:25:49 +08:00 |
|
wangxiyuan
|
d9ac9e3dc5
|
[Misc] fix collect_env version parse (#15267)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2025-04-21 20:29:40 +08:00 |
|
Han Zhang
|
d41faaf9df
|
Restore buffers when wake up from level 2 sleep (#16564) (#16889)
Signed-off-by: Han <zh950713@gmail.com>
|
2025-04-21 20:18:28 +08:00 |
|
Alex Brooks
|
b34f33438a
|
[Doc] Split dummy_processor_inputs() in Multimodal Docs (#16915)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
|
2025-04-21 11:10:01 +00:00 |
|
Yang Fan
|
26c0406555
|
[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni (#16907)
|
2025-04-21 10:25:21 +00:00 |
|
Woosuk Kwon
|
4c41278b77
|
[CI/CD][V1] Add spec decode tests to CI (#16900)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-04-20 22:37:16 -07:00 |
|
qizixi
|
bb3605db85
|
[Bugfix] Fix v1/spec_decode/test_ngram.py (#16895)
Signed-off-by: qizixi <qizixi@meta.com>
|
2025-04-20 20:54:29 -07:00 |
|
Richard Zou
|
fe742aef5a
|
[easy] Pass compile_fx only the config patches (#16845)
Signed-off-by: rzou <zou3519@gmail.com>
|
2025-04-20 12:25:19 +08:00 |
|
Harry Mellor
|
4b07d36891
|
Improve configs - CacheConfig (#16835)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-04-20 12:25:04 +08:00 |
|
Staszek Paśko
|
87aaadef73
|
Serialize tensors using int8 views (#16866)
Signed-off-by: Staszek Pasko <staszek@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2025-04-19 10:28:34 -07:00 |
|
Richard Zou
|
682e0b6d2f
|
Log how much time loading a compiled artifact takes (#16848)
Signed-off-by: rzou <zou3519@gmail.com>
|
2025-04-19 16:50:46 +00:00 |
|