Woosuk Kwon
|
54cf1cae62
|
[Misc] Do not print async output warning for v1 (#21151)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-07-17 21:57:02 -07:00 |
|
Shu Wang
|
c7d8724e78
|
[Core] FlashInfer CUTLASS fused MoE backend (NVFP4) (#20037)
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-17 21:32:45 -07:00 |
|
Lucas Wilkinson
|
89cab4d01f
|
[Attention] Make local attention backend agnostic (#21093)
|
2025-07-18 00:10:42 -04:00 |
|
elvischenv
|
8dfb45ca33
|
[Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel (#21133)
|
2025-07-18 00:35:58 +00:00 |
|
Wentao Ye
|
8a8fc94639
|
[Log] Debugging Log with more Information (#20770)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-18 00:19:46 +00:00 |
|
Woosuk Kwon
|
4de7146351
|
[V0 deprecation] Remove V0 HPU backend (#21131)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-07-17 16:37:36 -07:00 |
|
Eric Curtin
|
ac9fb732a5
|
On environments where numa cannot be detected we get 0 (#21115)
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
|
2025-07-17 18:52:17 +00:00 |
|
Jee Jee Li
|
a3a6c695f4
|
[Misc] Qwen MoE model supports LoRA (#20932)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-07-17 18:32:52 +00:00 |
|
Cyrus Leung
|
90bd2ab6e3
|
[Model] Update pooling model interface (#21058)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-17 16:05:40 +00:00 |
|
ElizaWszola
|
9fb2d22032
|
[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
2025-07-17 09:56:44 -04:00 |
|
wangxiyuan
|
89e3c4e9b4
|
[Misc] Avoid unnecessary import (#21106)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2025-07-17 12:57:41 +00:00 |
|
Harry Mellor
|
fe8a2c544a
|
[Docs] Improve docstring formatting for FusedMoEParallelConfig.make (#21117)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-07-17 04:13:00 -07:00 |
|
kYLe
|
4ef00b5cac
|
[VLM] Add Nemotron-Nano-VL-8B-V1 support (#20349)
Signed-off-by: Kyle Huang <kylhuang@nvidia.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-07-17 03:07:55 -07:00 |
|
Asher
|
5a7fb3ab9e
|
[Model] Add ToolParser and MoE Config for Hunyuan A13B (#20820)
Signed-off-by: Asher Zhang <asherszhang@tencent.com>
|
2025-07-17 09:10:09 +00:00 |
|
Varun Sundar Rabindranath
|
11dfdf21bf
|
[Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels (#20903)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-07-17 08:10:37 +00:00 |
|
Chauncey
|
fdc5b43d20
|
[Bugfix]: Fix final_res_batch list index out of range error (#21055)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
2025-07-17 00:29:09 -07:00 |
|
Jee Jee Li
|
c5b8b5953a
|
[Misc] Fix PhiMoE expert mapping (#21085)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-07-17 05:47:49 +00:00 |
|
David Ben-David
|
4fcef49ec4
|
[V1] [KVConnector] Fix MultiprocExecutor worker output aggregation (#21048)
Signed-off-by: David Ben-David <davidb@pliops.com>
Co-authored-by: David Ben-David <davidb@pliops.com>
|
2025-07-17 13:29:45 +08:00 |
|
Zhonghua Deng
|
8a4e5c5f3c
|
[V1][P/D]Enhance Performance and code readability for P2pNcclConnector (#20906)
Signed-off-by: Abatom <abzhonghua@gmail.com>
|
2025-07-16 22:13:00 -07:00 |
|
Lucas Wilkinson
|
76b494444f
|
[Attention] Refactor attention metadata builder interface (#20466)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2025-07-17 04:44:25 +00:00 |
|
Michael Goin
|
28a6d5423d
|
[Bugfix] Fix Machete zero point issue for GPTQ models on SM90 (#21066)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-16 19:54:45 -07:00 |
|
Kevin_Xiong
|
c9ba8104ed
|
[Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group (#21024)
Signed-off-by: KevinXiong-C <kevin_xiong1997@outlook.com>
|
2025-07-16 19:36:36 -07:00 |
|
QiliangCui
|
72ad273582
|
Remove torch_xla.tpu.version() from pallas.py. (#21065)
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
|
2025-07-17 00:25:26 +00:00 |
|
Nir David
|
01513a334a
|
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) (#12010)
Signed-off-by: Nir David <ndavid@habana.ai>
Signed-off-by: Uri Livne <ulivne@habana.ai>
Co-authored-by: Uri Livne <ulivne@habana.ai>
|
2025-07-16 15:33:41 -04:00 |
|
Cyrus Leung
|
ac2bf41e53
|
[Model] Remove model sampler (#21059)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-16 19:03:37 +00:00 |
|
Harry Mellor
|
a931b4cdcf
|
Remove Qwen Omni workaround that's no longer necessary (#21057)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-07-16 16:25:23 +00:00 |
|
Avshalom Manevich
|
a0f8a79646
|
[fix] fix qwen image_embeds input (#21049)
Signed-off-by: h-avsha <avshalom.manevich@hcompany.ai>
|
2025-07-16 15:17:20 +00:00 |
|
Mac Misiura
|
18bdcf4113
|
feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information (#20575)
Signed-off-by: m-misiura <mmisiura@redhat.com>
|
2025-07-16 21:52:14 +08:00 |
|
Cyrus Leung
|
1c3198b6c4
|
[Model] Consolidate pooler implementations (#20927)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-16 13:39:13 +00:00 |
|
Chengji Yao
|
85431bd9ad
|
[TPU] fix kv_cache_update kernel block size choosing logic (#21007)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
|
2025-07-16 04:39:48 +00:00 |
|
zhiweiz
|
c11013db8b
|
[Meta] Llama4 EAGLE Support (#20591)
Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: qizixi <qizixi@meta.com>
|
2025-07-15 21:14:15 -07:00 |
|
Peter Pan
|
1eb2b9c102
|
[CI] update typos config for CI pre-commit and fix some spells (#20919)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
|
2025-07-15 21:12:40 -07:00 |
|
Wentao Ye
|
76ddeff293
|
[Doc] Remove duplicate docstring (#21012)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-15 20:09:13 -07:00 |
|
Michael Goin
|
f46098335b
|
[Bugfix] Fix Mistral3 support on SM100/SM120 (#20998)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-15 20:08:41 -07:00 |
|
Ming Yang
|
fcb9f879c1
|
[Bugfix] Correct per_act_token in CompressedTensorsW8A8Fp8MoECutlassM… (#20937)
Signed-off-by: Ming Yang <minos.future@gmail.com>
|
2025-07-15 19:53:42 -07:00 |
|
Reid
|
fa839565f2
|
[Misc] Refactor: Improve argument handling for conda command (#20481)
Signed-off-by: reidliu41 <reid201711@gmail.com>
|
2025-07-15 19:43:19 -07:00 |
|
Brayden Zhong
|
75a99b98bf
|
[Chore] Remove outdated transformers check (#20989)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-07-15 19:42:40 -07:00 |
|
Thomas Parnell
|
6cbc4d4bea
|
[Model] Add ModelConfig class for GraniteMoeHybrid to override default max_seq_len_to_capture (#20923)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2025-07-15 19:19:10 -07:00 |
|
Michael Goin
|
153c6f1e61
|
[Frontend] Remove print left in FrontendArgs.add_cli_args (#21004)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-15 19:18:41 -07:00 |
|
Chauncey
|
34cda778a0
|
[Frontend] OpenAI Responses API supports input image (#20975)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
2025-07-15 18:59:36 -06:00 |
|
Elfie Guo
|
30800b01c2
|
[Nvidia] Integrate SM100 cudnn prefill API to MLA prefill (#20411)
Signed-off-by: Elfie Guo <elfieg@nvidia.com>
Co-authored-by: Elfie Guo <eflieg@nvidia.com>
|
2025-07-15 17:56:45 -07:00 |
|
Chen LI
|
10be209493
|
[Bug Fix] get_distributed_init_method should get the ip from get_ip i… (#20889)
Signed-off-by: Chen Li <lcpingping@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-07-15 21:23:52 +00:00 |
|
Marko Rosenmueller
|
19c863068b
|
[Frontend] Support cache_salt in /v1/completions and /v1/responses (#20981)
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
|
2025-07-15 21:01:04 +00:00 |
|
Tuan, Hoang-Trong
|
f29fd8a7f8
|
[BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 (#20838)
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
|
2025-07-15 16:08:26 -04:00 |
|
Harry Mellor
|
b637e9dcb8
|
Add full serve CLI reference back to docs (#20978)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-07-15 17:42:30 +00:00 |
|
Harry Mellor
|
1e36c8687e
|
[Deprecation] Remove nullable_kvs (#20969)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-07-15 17:21:50 +00:00 |
|
Harry Mellor
|
313ae8c16a
|
[Deprecation] Remove everything scheduled for removal in v0.10.0 (#20979)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-07-15 15:57:53 +00:00 |
|
Patrick von Platen
|
e7e3e6d263
|
Voxtral (#20970)
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-07-15 07:35:30 -07:00 |
|
Christian Pinto
|
4ffd963fa0
|
[v1][core] Support for attention free models (#20811)
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
|
2025-07-15 14:20:01 +00:00 |
|
Harry Mellor
|
56fe4bedd6
|
[Deprecation] Remove TokenizerPoolConfig (#20968)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-07-15 14:00:50 +00:00 |
|