Commit Graph

550 Commits

Author SHA1 Message Date
Robert Shaw
5a93b9162b [MoE Refactor] Integrate Naive Prepare Finalize into MK (#32567)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: amirkl94 <203507526+amirkl94@users.noreply.github.com>
2026-01-27 01:28:02 +00:00
Wentao Ye
8f987883cb [Refactor] Remove unused _moe_permute function (#33108)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-26 16:06:45 -05:00
Roberto L. Castro
fcb9df99bd [Perf][Kernel] Optimize FP4 quantization kernels (SM100F) (#32520)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
2026-01-24 18:45:27 -07:00
Michael Goin
4561f13985 [Refactor] Rename gptq_marlin to marlin to match MoE (#32952)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-23 16:48:12 -05:00
Wentao Ye
dfab5f3764 [Bug] Fix benchmark script moe_permute_unpermute (#32949)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-23 16:18:56 -05:00
Xin Yang
d08b356ee0 [Perf] Create TMA-aligned input scale tensor for DeepGemm on Hopper (#32619)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-01-22 15:47:04 -05:00
Xin Yang
63227accf5 [Kernel] Add topk_sigmoid kernel (#31246)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-01-21 22:49:51 +00:00
danisereb
f999539869 Add missing import of fused_topk to benchmark_moe (#32784)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
2026-01-21 18:30:10 +00:00
whx
1861ae8aae [PluggableLayer][1/N] Define PluggableLayer (Fix ci) (#32744)
Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-01-21 11:38:04 -05:00
Robert Shaw
42135d6898 [MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (#32414) 2026-01-21 08:22:33 -05:00
Yuxuan Zhang
71832ba71e [GLM-4.7] GLM Model support for GLM-Lite (#31386)
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: Yuxuan Zhang <2448370773@qq.com>
2026-01-19 01:18:38 -08:00
Andika Rachman
5e034f2e3d [cpu][bench] Add Fused MoE Micro Benchmark for CPU Backend (#32092)
Signed-off-by: andikarachman <andika.rachman.y@gmail.com>
2026-01-12 10:03:28 +00:00
Matthew Bonanni
2612ba9285 [1/N][Attention] Restructure attention: move files (#31916)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-09 13:10:24 -08:00
Nick Hill
29ce48221c [Cleanup] Remove obsolete spec decoding compatibility logic (#32003)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-09 05:44:18 +00:00
Robert Shaw
9f6dcb71ae [MoE Refactor][16/N] Apply Refactor to NVFP4 (#31692)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
2026-01-08 03:46:27 +00:00
Robert Shaw
5dcd7ef1f2 [MoE Refactor][15/N] Apply Refactor to Fp8 (#31415) 2026-01-07 19:42:33 -05:00
BlankR
0790f07695 [Misc] Improve error messages for unsupported types and parameters (#30593)
Signed-off-by: BlankR <hjyblanche@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-07 09:00:16 +00:00
Cyrus Leung
db318326a5 [Misc] Use deprecated for seed_everything (#31780)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-06 11:29:55 +00:00
Fadi Arafeh
799b5721f6 [cpu][bench] Add CPU paged attention benchmarks (#31720)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2026-01-06 10:57:57 +00:00
Michael Goin
e1cd7a5faf [Bugfix] Add init_workspace_manager to moe kernel benchmarks (#31042)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-05 19:14:33 -08:00
Alfred
a0e9ee83c7 [Benchmark] Fix OOM during MoE kernel tuning for large models (#31604)
Signed-off-by: Alfred <massif0601@gmail.com>
2026-01-02 22:24:51 +00:00
Amir Samani
030fc44914 use the same stream for cuda graph catpure and replay for NCCL (#29207)
Signed-off-by: Amir Samani <asamani@nvidia.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-12-25 19:10:03 +08:00
Cyrus Leung
8cef137689 [Chore] Update more locations to use attention_config.backend (#31153)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-22 19:19:50 -08:00
Michael Goin
06d490282f [NVFP4][Perf] Tune NVFP4 input quant kernel for small batch size (#30897)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-21 09:41:57 -08:00
Michael Goin
0a1ab1e565 [Perf][Kernels] Vectorize csrc/activations_kernels.cu (#29512)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-16 14:56:02 -08:00
Kevin Musgrave
c01d589813 [Benchmarks] auto_tune.sh: Use hostname variable for server requests (#30529)
Signed-off-by: Kevin Musgrave <kevin.musgrave@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-15 22:00:29 +00:00
Cyrus Leung
64251f48df [Chore] Adjust tokenizer import to avoid circular imports (#30601)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-13 04:42:39 -08:00
Harry Mellor
cf3eacfe58 Standardise get_rope to use rope_parameters["partial_rotary_factor"], not rotary_dim (#30389)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-11 20:45:23 +00:00
Cyrus Leung
d917747c95 [Bugfix] Fix task still being passed in tests/benchmarks (#30476)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-11 10:33:55 +00:00
Ming Yang
fba8906930 [perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill (#29710)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-12-11 08:20:45 +00:00
Cyrus Leung
7e24e5d4d6 [Deprecation] Remove deprecated task, seed and MM settings (#30397)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-10 19:59:39 -08:00
Matthew Bonanni
794a7875ee [Misc] Consistent case for vllm bench serve results (#30403)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-12-10 09:44:02 -08:00
Benjamin Chislett
e858bfe051 [Cleanup] Refactor profiling env vars into a CLI config (#29912)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-09 13:29:33 -05:00
ElizaWszola
af0444bf40 [Performance] Fused blockwise quant RMS norm (#27883)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
2025-12-07 16:38:04 +00:00
Jinzhen Lin
879ddb09c3 [Kernel][MoE] optimize moe_align_block_size (#29642)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-07 01:58:47 -08:00
Varun Sundar Rabindranath
19bee6d12d [Performance][DP/EP] Add silu_mul_per_token_group_quant_fp8_colmajor kernel (#29470)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-12-03 18:04:59 +00:00
Lumis Chen
9bcf92295a [Core] Add xxHash as a high-performance hash option for accelerating prefix caching (#29163)
Signed-off-by: LuminolT <lumischen01@gmail.com>
Signed-off-by: Lumis Chen <lumischen01@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2025-12-03 16:06:57 +00:00
Harry Mellor
951445a52d Remove default values from InitVars so that they're not stored (#29859)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-02 12:16:37 +00:00
Cyrus Leung
653591d5e7 [Chore] Move tokenizer initialization methods (#29793)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-02 13:33:37 +08:00
Liu Jinyi
f5516039c5 [Doc] fix heading levels (#29783)
Signed-off-by: KKKZOZ <kkkzoz@qq.com>
2025-12-01 14:49:22 +00:00
Jinzhen Lin
1656ad3704 [Kernel][Quantization] add w4a8 support for marlin kernel (#24722)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
2025-11-29 07:19:33 -08:00
Cyrus Leung
34a984274e [Misc] Refactor tokenizer interface (#29693)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-29 04:02:21 -08:00
Didier Durand
eca7a8fb59 [Doc]: fix typos in various files (#29230)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-24 11:10:48 +00:00
Pan Li
e5bfcb6a88 [BugFix][PD]: make example proxy usable with P2pNcclConnector (#26628)
Signed-off-by: PAN <1162953505@qq.com>
2025-11-20 17:38:31 +00:00
Harry Mellor
a8b70304d6 Update rope_scaling to rope_parameters in preparation for Transformers v5 (#28542)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-19 09:06:36 -08:00
Kunshang Ji
2a2d5d2780 Replace torch.cuda.Event with torch.Event for better hardware compatibility (#26985)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-18 11:34:36 -08:00
Ido Segev
49a986ecd4 [Benchmark] multi_turn: Report warmup-inclusive runtime (#28937)
Signed-off-by: Ido Segev <idos@pliops.com>
2025-11-18 16:38:22 +00:00
ai-jz
d231876ce3 [Benchmark] Fix client seed synchronization in multi-turn benchmark (#28512)
Signed-off-by: ai-jz <aijz.xplr@gmail.com>
2025-11-16 15:04:32 +08:00
Bram Wasti
f849ee739c Adding a benchmark for batch invariance (#28161)
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-16 13:22:17 +08:00
Di Wu
e63fd44560 Fix: Correctly filter special tokens in benchmark_prefix_caching (#28615)
Signed-off-by: Di Wu <dw2761@nyu.edu>
2025-11-13 10:57:44 +00:00