Max Hu
|
9c3fe9936b
|
Flashinfer cuDNN backend for Qwen3 VL ViT attention (#34580)
Signed-off-by: Max Hu <maxhu@nvidia.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Co-authored-by: Max Hu <maxhu@nvidia.com>
Co-authored-by: Shang Wang <shangw@nvidia.com>
|
2026-02-27 20:20:23 +08:00 |
|
Umut Polat
|
b66a74649e
|
[Bugfix] Replace assert with ValueError for response_format validation in completions endpoint (#35456)
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com>
|
2026-02-27 08:01:06 +00:00 |
|
Wang Xingran
|
07bdabef03
|
[Bugfix] Use 'sum' reduction instead of 'avg' in Async TP reduce-scatter (#33088)
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com>
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
Co-authored-by: Hongjian Zhang <hirokenovo@gmail.com>
|
2026-02-27 07:06:08 +00:00 |
|
Chengyi Nie
|
a572baff5e
|
[Model Performance] Add Qwen3MoE tuned MoE configs for H200 (#35457)
Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Chengyi Nie <cnie@roblox.com>
|
2026-02-27 13:51:14 +08:00 |
|
zofia
|
516cf26698
|
[Bug] correct out dtype of rms_norm_gated native path (#35369)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2026-02-27 05:19:51 +00:00 |
|
Jiangyun Zhu
|
487e5c51f7
|
[Bugfix] disable allreduce_rms_fusion by default when pp size > 1 (#35424)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
|
2026-02-27 04:18:52 +00:00 |
|
Daniel Huang
|
1a8c71674e
|
[BugFix] Repo utils debug print patch (#35434)
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
|
2026-02-27 03:50:56 +00:00 |
|
Wentao Ye
|
062b789632
|
[Bug] Fix outdated links in source code (#35314)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-02-27 03:50:46 +00:00 |
|
gnovack
|
a532c83849
|
use 'max_active_experts' for moe lora input size (#33197)
Signed-off-by: gnovack <gnovack@amazon.com>
|
2026-02-27 03:50:43 +00:00 |
|
Jee Jee Li
|
1e5ad9b74f
|
[Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping (#35413)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2026-02-26 19:46:30 -08:00 |
|
Nicolò Lucchesi
|
cabdaa7619
|
[Misc] Move GPUModelRunner.prepare_kernel_block_sizes to utils (#35400)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2026-02-27 11:42:51 +08:00 |
|
Chenyaaang
|
06be53563b
|
[Core]Extract is_last_rank in Ray for tpu to override (#33012)
Signed-off-by: Chenyaaang <chenyangli@google.com>
|
2026-02-27 03:18:52 +00:00 |
|
Angela Yi
|
c29ee9c326
|
[compile] Invalidate cache for cpu flags (#35119)
Signed-off-by: angelayi <yiangela7@gmail.com>
|
2026-02-27 02:54:11 +00:00 |
|
daniel-salib
|
d43048ce05
|
[Bugfix] Emit reasoning_part events in simple streaming path for Resp… (#35184)
Signed-off-by: Daniel Salib <danielsalib@meta.com>
|
2026-02-27 09:49:06 +08:00 |
|
Michael Goin
|
4fec53cfcb
|
[CI] Actually run tests/kernels/quantization/test_block_fp8.py in CI (#34274)
|
2026-02-26 17:58:03 -07:00 |
|
roikoren755
|
38c498b8e3
|
[Performance] Cublas Bf16 Gate with Fp32 Output (#35121)
Signed-off-by: Roi Koren <roik@nvidia.com>
|
2026-02-26 16:51:28 -08:00 |
|
Andrii Skliar
|
56a6371706
|
[Update] Use FlashInfer fast_decode_plan directly instead of replication (#34687)
Signed-off-by: Andrii <askliar@nvidia.com>
Co-authored-by: Andrii <askliar@nvidia.com>
|
2026-02-26 16:31:43 -08:00 |
|
Pavani Majety
|
6283021142
|
[Bugfix] Fix KV Scale loading for MLA Models (#35430)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2026-02-26 23:38:19 +00:00 |
|
Aleksandr Malyshev
|
01923eec70
|
[ROCm][Quantization] GPT OSS Upstream MoE wmxfp4_afp8 with static scales (#30357)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
|
2026-02-26 16:50:16 -06:00 |
|
pkousha
|
31fb6f43da
|
[Kernel][perf] optimize NCCL symm_mem vs custom_AR selection thresholds (#33839)
Signed-off-by: <>
Signed-off-by: pkousha <43781676+pkousha@users.noreply.github.com>
Co-authored-by: Pouya Kousha <pkousha@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-02-26 14:35:58 -08:00 |
|
Tyler Michael Smith
|
eb19955c37
|
[WideEP] Remove pplx all2all backend (#33724)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-02-26 14:30:10 -08:00 |
|
Lucia Fang
|
0f2f24c8b2
|
[Bugfix] Fix MessageQueue connect_ip for cross-node data parallelism (#35429)
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
|
2026-02-26 22:08:16 +00:00 |
|
sychen52
|
d0105b84f0
|
add mixed precision support for modelopt (#35047)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
|
2026-02-26 21:56:24 +00:00 |
|
danielafrimi
|
832a780f3a
|
Nemotron: use per-layer config in NemotronHMLPDecoderLayer for heterogeneous models (#35396)
Signed-off-by: dafrimi <dafrimi@nvidia.com>
|
2026-02-26 16:55:19 -05:00 |
|
ElizaWszola
|
98217b09f9
|
[Performance] Extract KV cache update op from flashinfer forward (#35422)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
2026-02-26 21:29:01 +00:00 |
|
不做了睡大觉
|
967572dd5f
|
fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning (#35230)
Signed-off-by: stakeswky <stakeswky@users.noreply.github.com>
Co-authored-by: stakeswky <stakeswky@users.noreply.github.com>
|
2026-02-26 20:30:45 +00:00 |
|
Woosuk Kwon
|
3d66502e1b
|
[Model Runner V2] Prepare attn metadata in ModelState [2/N] (#35383)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-02-26 11:47:02 -08:00 |
|
Woosuk Kwon
|
c66aa48e99
|
[Model Runner V2] Add model states [1/N] (#35350)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-02-26 11:20:35 -08:00 |
|
Nick Hill
|
b6d5a17298
|
[Model Runner V2] Fix error-handling (#35063)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-02-26 11:00:19 -08:00 |
|
Lucas Wilkinson
|
5e58bdc711
|
[Bugfix] Remove erroneous lower bound on LoRA vocab size constraint (#35354)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2026-02-26 18:44:50 +00:00 |
|
Runkai Tao
|
a1f53addb1
|
[BugFix] Align fused MoE-LoRA kernel config with actual weight shapes (#34396)
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
|
2026-02-26 18:03:10 +00:00 |
|
Wentao Ye
|
05970c772c
|
[Refactor] Remove dead code for attention benchmark script (#35418)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-02-26 09:53:46 -08:00 |
|
Yiliu Dong
|
d940607629
|
[Core] Support min_tokens with speculative decoding (#32642)
Signed-off-by: qianlihuang <yiliu.dong@qq.com>
Co-authored-by: qianlihuang <yiliu.dong@qq.com>
|
2026-02-26 12:31:28 -05:00 |
|
Wentao Ye
|
99c7892c5b
|
[Perf] Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement (#35330)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-02-26 17:14:54 +00:00 |
|
hujia177
|
ec8f943db1
|
Add GlmOcrConfig for GLM-OCR model type recognition (#34982)
|
2026-02-26 17:04:42 +00:00 |
|
Or Ozeri
|
f2ad952f40
|
[BugFix][kv_offload]: Fix kernel block size detection (#35125)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
|
2026-02-26 16:29:34 +00:00 |
|
Sage Moore
|
9e2cabdf9c
|
[ROCm] Update the torch version in rocm_build.txt to use the official 2.10 release (#34387)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
|
2026-02-26 16:28:45 +00:00 |
|
Douglas Lehr
|
ec8ab9d254
|
[ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers (#34157)
Signed-off-by: Doug Lehr <douglehr@amd.com>
Signed-off-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
Co-authored-by: Doug Lehr <douglehr@amd.com>
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
|
2026-02-26 10:00:49 -06:00 |
|
Wentao Ye
|
05972ea7e5
|
[Refactor] Remove dead or duplicate func utils or variables (#35318)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-02-26 10:57:56 -05:00 |
|
Jakub Zakrzewski
|
111d869069
|
[Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model (#35297)
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
|
2026-02-26 14:17:17 +00:00 |
|
stingoChen
|
7fea7250a4
|
[Bug] Fix missing <think> tag after tool call in MiniMax 2.1 (#35352)
Signed-off-by: 冬马 <chenxinke@cai-inc.com>
Co-authored-by: 冬马 <chenxinke@cai-inc.com>
|
2026-02-26 22:11:07 +08:00 |
|
Cyrus Leung
|
845ee348ef
|
[Misc] Standardize handling of mm_processor_kwargs.size (#35284)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-02-26 13:05:46 +00:00 |
|
Asaf Gardin
|
ec13e549d3
|
[Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic (#35275)
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
2026-02-26 12:22:06 +00:00 |
|
Li-Yongwen
|
c6ca51598a
|
[Bugfix] fix device_name for routing replay (#34336)
Signed-off-by: liyongwen <1310439159@qq.com>
|
2026-02-26 12:18:38 +00:00 |
|
Yueqian Lin
|
c0615a296d
|
[Bugfix] Fix Qwen2.5-Omni and Qwen3-Omni mixed-modality embed regression (#35368)
Signed-off-by: linyueqian <linyueqian@outlook.com>
|
2026-02-26 11:58:23 +00:00 |
|
Harry Mellor
|
01914445b0
|
Remove bc-lint (#35274)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-02-26 03:01:01 -08:00 |
|
Kunshang Ji
|
5281713e11
|
[XPU] use fixed UMD version in dockerfile.xpu (#35392)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
2026-02-26 18:54:55 +08:00 |
|
HZY
|
32693db8ce
|
[Bugfix] [Qwen3.5]Fix Qwen3.5 FP8 quantization: tuple shard_id weight loading (#35289)
Signed-off-by: daowu.hzy <daowu.hzy@alibaba-inc.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2026-02-26 18:26:15 +08:00 |
|
Akash kaothalkar
|
e03ddcfbd4
|
[Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le (#35081)
Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash kaothalkar <akash.kaothalkar@ibm.com>
|
2026-02-26 10:21:24 +00:00 |
|
Sophie du Couédic
|
02acd16861
|
[Benchmarks] Plot benchmark timeline and requests statistics (#35220)
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2026-02-26 02:17:43 -08:00 |
|