Tao He
|
e93f4cc9e3
|
Add the support for the qwen3 next model (a hybrid attention model). (#24526)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-09-11 15:32:09 +08:00 |
|
Didier Durand
|
e2b1f863aa
|
[Doc]: fixing doc typos (#24635)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
|
2025-09-10 23:19:28 -07:00 |
|
Hanjie Qiu
|
dcb28a332b
|
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration (#21078)
Signed-off-by: hjjq <hanjieq@nvidia.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-09-10 15:31:10 -07:00 |
|
Michael Goin
|
fba7856581
|
[Perf] Warmup FlashInfer attention during startup (#23439)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com>
|
2025-09-10 15:03:17 -07:00 |
|
Chen Zhang
|
b5e383cd8b
|
[gpt-oss] raise error for flashinfer backend without trtllm (#24482)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-09-10 14:33:13 -07:00 |
|
Gregory Shtrasberg
|
9a161307f5
|
[torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends (#19767)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-09-10 13:59:55 -07:00 |
|
Russell Bryant
|
37e8182bfe
|
[v1] Add Whisper model support (encoder-decoder) (#21088)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
|
2025-09-10 13:53:35 -07:00 |
|
Thien Tran
|
a0933c3bd6
|
[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs (#24577)
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
|
2025-09-10 12:33:41 -07:00 |
|
Lucas Wilkinson
|
0ae43dbf8c
|
[Attention] add DCP support for FLASH_ATTN_MLA backend (#24453)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-09-10 17:19:26 +08:00 |
|
Yong Hoon Shin
|
dc625ea6b8
|
[Perf] Convert np array to torch tensor to index into block table for attn chunking (#24474)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
2025-09-09 20:01:06 -07:00 |
|
Wentao Ye
|
15de5ff9ea
|
[Feature] Disallow FlashMLA on Blackwell (#24521)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-09-09 14:59:34 -04:00 |
|
elvischenv
|
bba1042c6f
|
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel (#23647)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
|
2025-09-08 20:53:07 -07:00 |
|
Matthew Bonanni
|
620db1fc58
|
[Attention] FlashAttention MLA cudagraph support (#23958)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2025-09-08 22:05:26 +00:00 |
|
tomeras91
|
e041314184
|
[Bugfix] Fix mamba2 prefill chunking (#23279)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-09-08 11:42:41 +00:00 |
|
Ming Yang
|
86173ad593
|
[Kernel] Support decode context parallelism on Blackwell with CUTLASS MLA (#24385)
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-09-08 09:27:12 +08:00 |
|
youkaichao
|
558f0907dc
|
[attention][DCP] use AttentionImpl.need_to_return_lse_for_decode (#24372)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-09-07 01:18:59 +00:00 |
|
yzds
|
ac201a0eaf
|
[Feature] Support Decode Context Parallel (DCP) for MLA (#23734)
Signed-off-by: hongchao <hongchao@msh.team>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: hongchao <hongchao@msh.team>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-09-06 13:24:05 +08:00 |
|
Didier Durand
|
35bf193864
|
[Doc]: fix typos in Python comments (#24294)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-09-05 19:41:12 -07:00 |
|
Didier Durand
|
83609ca91d
|
[Doc]: fix typos in Python comments (#24173)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-09-04 08:52:17 -07:00 |
|
Kunshang Ji
|
16ded21eeb
|
[XPU] support Triton Attention backend on Intel GPU (#24149)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
2025-09-04 20:41:08 +08:00 |
|
Lucas Wilkinson
|
402759d472
|
[Attention] FlashAttn MLA (#14258)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-09-04 02:47:59 -07:00 |
|
Matthew Bonanni
|
a742322092
|
[Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend (#23289)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-09-03 14:05:24 -04:00 |
|
Didier Durand
|
d7e1e59972
|
[Doc]: fix typos in Python comments (#24093)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
|
2025-09-02 21:05:45 -07:00 |
|
co63oc
|
1bd007f234
|
fix some typos (#24071)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
|
2025-09-02 20:44:50 -07:00 |
|
Ning Xie
|
fb4983e112
|
[Misc] add reorder_batch AttentionMetadataBuilder (#23798)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
|
2025-08-30 06:41:45 -07:00 |
|
Huy Do
|
67c14906aa
|
Update PyTorch to 2.8.0 (#20358)
Signed-off-by: Huy Do <huydhn@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-08-29 18:57:35 +08:00 |
|
Yong Hoon Shin
|
cb293f6a79
|
[V1] Enable prefill optimization for Gemma3n (#22628)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
2025-08-28 14:54:30 -07:00 |
|
Woosuk Kwon
|
7ffbf27239
|
[BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr_cpu (#23737)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-28 14:22:46 -07:00 |
|
Divakar Verma
|
04d1dd7f4a
|
[ROCm][Aiter] Add triton fp8 bmm kernel for mla (#23264)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
|
2025-08-28 18:18:08 +00:00 |
|
Hyogeun Oh (오효근)
|
4e4d017b6f
|
[Docs] Fix warnings in mkdocs build (continued) (#23743)
Signed-off-by: Zerohertz <ohg3417@gmail.com>
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com>
|
2025-08-27 17:17:29 +00:00 |
|
Woosuk Kwon
|
11eddf02f0
|
[FlashInfer] Cache hyper params in metadata builder (#23732)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-27 03:45:04 -07:00 |
|
Woosuk Kwon
|
6578e87365
|
Optimize input preparation for FlashInfer [2/N] (#23174)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-27 02:52:45 -07:00 |
|
Woosuk Kwon
|
efc88cf64a
|
[Misc] Simplify FlashInfer attention metadata (#23585)
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
|
2025-08-25 15:42:29 -07:00 |
|
Driss Guessous
|
e0329ed4b4
|
Updates to Flex + VLLm integration (#21416)
Signed-off-by: drisspg <drisspguessous@gmail.com>
|
2025-08-25 09:32:42 -04:00 |
|
Ayush Satyam
|
5c4b6e66fe
|
[Attention] Unify mamba and attention backend selection (#23171)
Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
|
2025-08-25 09:09:36 +00:00 |
|
elvischenv
|
24d0c9e6ed
|
[NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel (#22703)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-08-22 22:09:05 +00:00 |
|
Russell Bryant
|
281710ef9a
|
[Attention] Allow V1 flash_attn to support cross-attention (#23297)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-08-22 12:10:16 +00:00 |
|
Chen Zhang
|
17373dcd93
|
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models (#23154)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-08-22 05:05:59 +00:00 |
|
Matthew Bonanni
|
19fe1a0510
|
[Kernel] Add FP8 support with FlashMLA backend (#22668)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
|
2025-08-22 02:26:32 +00:00 |
|
Pavani Majety
|
1d353b6352
|
[Core] Always use tensor cores for Flashinfer Decode Wrapper (#23214)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-08-21 16:02:11 -04:00 |
|
Paul Pak
|
2e2000f352
|
[Model] Add LFM2 architecture (#22845)
Signed-off-by: Paul Pak <paulpak58@gmail.com>
|
2025-08-21 09:35:07 +02:00 |
|
Asaf Joseph Gardin
|
3663870c72
|
[V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support (#23035)
Signed-off-by: asafg <asafg@ai21.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Co-authored-by: asafg <asafg@ai21.com>
|
2025-08-20 20:08:51 -07:00 |
|
Matthew Bonanni
|
10cc12ba66
|
Feature/mla tests (#23195)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-08-20 21:46:47 +00:00 |
|
Matthew Bonanni
|
a4fbb32fab
|
Remove chunked_prefill_enabled flag in V1 MLA (#23183)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
|
2025-08-20 21:43:17 +00:00 |
|
Woosuk Kwon
|
d6d13bd49e
|
[Misc] Add max_seq_len to CommonAttentionMetadata (#23216)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-20 09:05:29 -07:00 |
|
who who who
|
d983769c41
|
fix cuda graph (#22721)
Signed-off-by: fsx950223 <fsx950223@outlook.com>
|
2025-08-20 06:24:37 +00:00 |
|
Zebing Lin
|
a634733f67
|
[Attention] Optimize make_local_attention_virtual_batches for Flash Attention (#23185)
Signed-off-by: linzebing <linzebing1995@gmail.com>
|
2025-08-20 02:57:47 +00:00 |
|
Lucas Wilkinson
|
14e2b0730b
|
[BugFix] fix CUTLASS MLA full cudagraph (#23200)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2025-08-19 22:17:08 +00:00 |
|
Woosuk Kwon
|
e61bac87ee
|
[Misc] Minor refactoring for FlashInfer backend (#23147)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-19 13:11:51 -07:00 |
|
Woosuk Kwon
|
5b5f350d67
|
[Misc] Enable yapf for FlashInfer backend (#23193)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-19 10:33:47 -07:00 |
|