Commit Graph

5253 Commits

Author SHA1 Message Date
Woosuk Kwon
54cf1cae62 [Misc] Do not print async output warning for v1 (#21151)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-17 21:57:02 -07:00
Shu Wang
c7d8724e78 [Core] FlashInfer CUTLASS fused MoE backend (NVFP4) (#20037)
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-17 21:32:45 -07:00
Lucas Wilkinson
89cab4d01f [Attention] Make local attention backend agnostic (#21093) 2025-07-18 00:10:42 -04:00
elvischenv
8dfb45ca33 [Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel (#21133) 2025-07-18 00:35:58 +00:00
Wentao Ye
8a8fc94639 [Log] Debugging Log with more Information (#20770)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-18 00:19:46 +00:00
Woosuk Kwon
4de7146351 [V0 deprecation] Remove V0 HPU backend (#21131)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-17 16:37:36 -07:00
Eric Curtin
ac9fb732a5 On environments where numa cannot be detected we get 0 (#21115)
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-07-17 18:52:17 +00:00
Jee Jee Li
a3a6c695f4 [Misc] Qwen MoE model supports LoRA (#20932)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-17 18:32:52 +00:00
Cyrus Leung
90bd2ab6e3 [Model] Update pooling model interface (#21058)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-17 16:05:40 +00:00
ElizaWszola
9fb2d22032 [Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
2025-07-17 09:56:44 -04:00
wangxiyuan
89e3c4e9b4 [Misc] Avoid unnecessary import (#21106)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-17 12:57:41 +00:00
Harry Mellor
fe8a2c544a [Docs] Improve docstring formatting for FusedMoEParallelConfig.make (#21117)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-17 04:13:00 -07:00
kYLe
4ef00b5cac [VLM] Add Nemotron-Nano-VL-8B-V1 support (#20349)
Signed-off-by: Kyle Huang <kylhuang@nvidia.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-07-17 03:07:55 -07:00
Asher
5a7fb3ab9e [Model] Add ToolParser and MoE Config for Hunyuan A13B (#20820)
Signed-off-by: Asher Zhang <asherszhang@tencent.com>
2025-07-17 09:10:09 +00:00
Varun Sundar Rabindranath
11dfdf21bf [Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels (#20903)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-17 08:10:37 +00:00
Chauncey
fdc5b43d20 [Bugfix]: Fix final_res_batch list index out of range error (#21055)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-07-17 00:29:09 -07:00
Jee Jee Li
c5b8b5953a [Misc] Fix PhiMoE expert mapping (#21085)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-17 05:47:49 +00:00
David Ben-David
4fcef49ec4 [V1] [KVConnector] Fix MultiprocExecutor worker output aggregation (#21048)
Signed-off-by: David Ben-David <davidb@pliops.com>
Co-authored-by: David Ben-David <davidb@pliops.com>
2025-07-17 13:29:45 +08:00
Zhonghua Deng
8a4e5c5f3c [V1][P/D]Enhance Performance and code readability for P2pNcclConnector (#20906)
Signed-off-by: Abatom <abzhonghua@gmail.com>
2025-07-16 22:13:00 -07:00
Lucas Wilkinson
76b494444f [Attention] Refactor attention metadata builder interface (#20466)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-07-17 04:44:25 +00:00
Michael Goin
28a6d5423d [Bugfix] Fix Machete zero point issue for GPTQ models on SM90 (#21066)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-16 19:54:45 -07:00
Kevin_Xiong
c9ba8104ed [Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group (#21024)
Signed-off-by: KevinXiong-C <kevin_xiong1997@outlook.com>
2025-07-16 19:36:36 -07:00
QiliangCui
72ad273582 Remove torch_xla.tpu.version() from pallas.py. (#21065)
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
2025-07-17 00:25:26 +00:00
Nir David
01513a334a Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) (#12010)
Signed-off-by: Nir David <ndavid@habana.ai>
Signed-off-by: Uri Livne <ulivne@habana.ai>
Co-authored-by: Uri Livne <ulivne@habana.ai>
2025-07-16 15:33:41 -04:00
Cyrus Leung
ac2bf41e53 [Model] Remove model sampler (#21059)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-16 19:03:37 +00:00
Harry Mellor
a931b4cdcf Remove Qwen Omni workaround that's no longer necessary (#21057)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-16 16:25:23 +00:00
Avshalom Manevich
a0f8a79646 [fix] fix qwen image_embeds input (#21049)
Signed-off-by: h-avsha <avshalom.manevich@hcompany.ai>
2025-07-16 15:17:20 +00:00
Mac Misiura
18bdcf4113 feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information (#20575)
Signed-off-by: m-misiura <mmisiura@redhat.com>
2025-07-16 21:52:14 +08:00
Cyrus Leung
1c3198b6c4 [Model] Consolidate pooler implementations (#20927)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-16 13:39:13 +00:00
Chengji Yao
85431bd9ad [TPU] fix kv_cache_update kernel block size choosing logic (#21007)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-07-16 04:39:48 +00:00
zhiweiz
c11013db8b [Meta] Llama4 EAGLE Support (#20591)
Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: qizixi <qizixi@meta.com>
2025-07-15 21:14:15 -07:00
Peter Pan
1eb2b9c102 [CI] update typos config for CI pre-commit and fix some spells (#20919)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
2025-07-15 21:12:40 -07:00
Wentao Ye
76ddeff293 [Doc] Remove duplicate docstring (#21012)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-15 20:09:13 -07:00
Michael Goin
f46098335b [Bugfix] Fix Mistral3 support on SM100/SM120 (#20998)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-15 20:08:41 -07:00
Ming Yang
fcb9f879c1 [Bugfix] Correct per_act_token in CompressedTensorsW8A8Fp8MoECutlassM… (#20937)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-07-15 19:53:42 -07:00
Reid
fa839565f2 [Misc] Refactor: Improve argument handling for conda command (#20481)
Signed-off-by: reidliu41 <reid201711@gmail.com>
2025-07-15 19:43:19 -07:00
Brayden Zhong
75a99b98bf [Chore] Remove outdated transformers check (#20989)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-07-15 19:42:40 -07:00
Thomas Parnell
6cbc4d4bea [Model] Add ModelConfig class for GraniteMoeHybrid to override default max_seq_len_to_capture (#20923)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-07-15 19:19:10 -07:00
Michael Goin
153c6f1e61 [Frontend] Remove print left in FrontendArgs.add_cli_args (#21004)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-15 19:18:41 -07:00
Chauncey
34cda778a0 [Frontend] OpenAI Responses API supports input image (#20975)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-07-15 18:59:36 -06:00
Elfie Guo
30800b01c2 [Nvidia] Integrate SM100 cudnn prefill API to MLA prefill (#20411)
Signed-off-by: Elfie Guo <elfieg@nvidia.com>
Co-authored-by: Elfie Guo <eflieg@nvidia.com>
2025-07-15 17:56:45 -07:00
Chen LI
10be209493 [Bug Fix] get_distributed_init_method should get the ip from get_ip i… (#20889)
Signed-off-by: Chen Li <lcpingping@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-07-15 21:23:52 +00:00
Marko Rosenmueller
19c863068b [Frontend] Support cache_salt in /v1/completions and /v1/responses (#20981)
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
2025-07-15 21:01:04 +00:00
Tuan, Hoang-Trong
f29fd8a7f8 [BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 (#20838)
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
2025-07-15 16:08:26 -04:00
Harry Mellor
b637e9dcb8 Add full serve CLI reference back to docs (#20978)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-15 17:42:30 +00:00
Harry Mellor
1e36c8687e [Deprecation] Remove nullable_kvs (#20969)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-15 17:21:50 +00:00
Harry Mellor
313ae8c16a [Deprecation] Remove everything scheduled for removal in v0.10.0 (#20979)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-15 15:57:53 +00:00
Patrick von Platen
e7e3e6d263 Voxtral (#20970)
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-07-15 07:35:30 -07:00
Christian Pinto
4ffd963fa0 [v1][core] Support for attention free models (#20811)
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
2025-07-15 14:20:01 +00:00
Harry Mellor
56fe4bedd6 [Deprecation] Remove TokenizerPoolConfig (#20968)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-15 14:00:50 +00:00