Commit Graph

4602 Commits

Author SHA1 Message Date
Aaron Hao
2ce6f3cf67 [Feat][RL][2/2] Native Weight Syncing API: IPC (#34171)
Signed-off-by: hao-aaron <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
2026-02-27 13:45:21 -07:00
Lucas Wilkinson
1d532f9d8f [DP] Only use DP padding when cudagraphs are actually used (#34102)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-02-27 15:14:31 -05:00
Zhengxu Chen
29b35477b0 [compile] Fix caching error over pytree slice node. (#35308)
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
2026-02-27 19:34:16 +00:00
Huamin Li
157722da75 [perf] Use pinned memory for async H2D transfer in do_mamba_copy_block (#35480)
Signed-off-by: Huamin Li <3ericli@gmail.com>
2026-02-28 01:50:37 +08:00
fort726
905d76b51d [Model] Add huggingface skt/A.X-K1 model (#32407)
Signed-off-by: Sungwan(Alex) Kim <sw0726.kim@sktelecom.com>
Signed-off-by: fort726 <38447663+fort726@users.noreply.github.com>
Co-authored-by: Sungwan(Alex) Kim <sw0726.kim@sktelecom.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2026-02-27 09:26:02 -08:00
Yanan Cao
9098ce690c [Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching (#34390)
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
2026-02-27 09:21:35 -08:00
Yueqian Lin
e8249378e4 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests (#35487)
Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-02-27 06:48:25 -08:00
Max Hu
9c3fe9936b Flashinfer cuDNN backend for Qwen3 VL ViT attention (#34580)
Signed-off-by: Max Hu <maxhu@nvidia.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Co-authored-by: Max Hu <maxhu@nvidia.com>
Co-authored-by: Shang Wang <shangw@nvidia.com>
2026-02-27 20:20:23 +08:00
Umut Polat
b66a74649e [Bugfix] Replace assert with ValueError for response_format validation in completions endpoint (#35456)
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com>
2026-02-27 08:01:06 +00:00
gnovack
a532c83849 use 'max_active_experts' for moe lora input size (#33197)
Signed-off-by: gnovack <gnovack@amazon.com>
2026-02-27 03:50:43 +00:00
Nicolò Lucchesi
cabdaa7619 [Misc] Move GPUModelRunner.prepare_kernel_block_sizes to utils (#35400)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-02-27 11:42:51 +08:00
daniel-salib
d43048ce05 [Bugfix] Emit reasoning_part events in simple streaming path for Resp… (#35184)
Signed-off-by: Daniel Salib <danielsalib@meta.com>
2026-02-27 09:49:06 +08:00
Michael Goin
4fec53cfcb [CI] Actually run tests/kernels/quantization/test_block_fp8.py in CI (#34274) 2026-02-26 17:58:03 -07:00
Andrii Skliar
56a6371706 [Update] Use FlashInfer fast_decode_plan directly instead of replication (#34687)
Signed-off-by: Andrii <askliar@nvidia.com>
Co-authored-by: Andrii <askliar@nvidia.com>
2026-02-26 16:31:43 -08:00
Tyler Michael Smith
eb19955c37 [WideEP] Remove pplx all2all backend (#33724)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 14:30:10 -08:00
Lucia Fang
0f2f24c8b2 [Bugfix] Fix MessageQueue connect_ip for cross-node data parallelism (#35429)
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
2026-02-26 22:08:16 +00:00
不做了睡大觉
967572dd5f fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning (#35230)
Signed-off-by: stakeswky <stakeswky@users.noreply.github.com>
Co-authored-by: stakeswky <stakeswky@users.noreply.github.com>
2026-02-26 20:30:45 +00:00
Lucas Wilkinson
5e58bdc711 [Bugfix] Remove erroneous lower bound on LoRA vocab size constraint (#35354)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-02-26 18:44:50 +00:00
Yiliu Dong
d940607629 [Core] Support min_tokens with speculative decoding (#32642)
Signed-off-by: qianlihuang <yiliu.dong@qq.com>
Co-authored-by: qianlihuang <yiliu.dong@qq.com>
2026-02-26 12:31:28 -05:00
Wentao Ye
99c7892c5b [Perf] Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement (#35330)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-02-26 17:14:54 +00:00
Jakub Zakrzewski
111d869069 [Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model (#35297)
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
2026-02-26 14:17:17 +00:00
Cyrus Leung
845ee348ef [Misc] Standardize handling of mm_processor_kwargs.size (#35284)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-02-26 13:05:46 +00:00
Yueqian Lin
c0615a296d [Bugfix] Fix Qwen2.5-Omni and Qwen3-Omni mixed-modality embed regression (#35368)
Signed-off-by: linyueqian <linyueqian@outlook.com>
2026-02-26 11:58:23 +00:00
Jiangyun Zhu
ab87f85231 [Model] Ring 2.5 (#35102)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2026-02-26 02:17:11 -08:00
Krish Gupta
3827c8c55a [Test] Add tests for n parameter in chat completions API (#35283)
Signed-off-by: KrxGu <krishom70@gmail.com>
2026-02-26 09:14:07 +00:00
Chaojun Zhang
9f9a675b23 [XPU][8/N] Fix kernel bugs in XPU LoRA and MOE LORA (#34115)
Signed-off-by: chzhang <chaojun.zhang@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2026-02-26 15:46:44 +08:00
Cyrus Leung
d3a51da92a [Benchmark] Simplify SLA scan (#35306)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-02-25 22:35:41 -08:00
Flora Feng
186ea22efe [Misc][Harmony] Move Responses API only harmony utils to responses/harmony.py (#35339)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
2026-02-26 14:35:16 +08:00
Jason Li
9d37941017 [torch.compile] Sequence Parallelism threshold compile ranges (#28672)
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com>
Signed-off-by: Jason Li <jasonlizhengjian@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2026-02-26 05:00:12 +00:00
Hanjie Qiu
71dfce6aa6 [Kernel] Refactor FlashInfer allreduce for mnnvl backend (#34109)
Signed-off-by: hjjq <50634613+hjjq@users.noreply.github.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
2026-02-26 03:17:20 +00:00
Seungmin Kim
160424a937 [Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs (#33992)
Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com>
Signed-off-by: Andrew Mello <19512127+88plug@users.noreply.github.com>
Co-authored-by: 88plug <19512127+88plug@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-02-25 18:15:51 -08:00
Michael Goin
de527e1cec [UX] Add --moe-backend arg for explicit kernel selection (#33807)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-02-25 17:44:44 -08:00
Yongye Zhu
1976356ee6 [MoE Refactor] MXFP4 Cutlass Experts to MK (#34542)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2026-02-25 17:32:39 -08:00
Ming Yang
6831650c40 [offloader] v2: Hide weight onloading latency via prefetching (#29941)
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-02-25 17:20:59 -08:00
Andreas Karatzas
9571e99945 [ROCm][CI] Extending attention backend coverage for Eagle spec decode tests (#35265)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-02-25 14:16:18 -08:00
Elizabeth Thomas
c97234c08b fix(mxfp4): Disable monolithic path for TRITON backend with EP (#34270)
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-02-25 13:33:42 -08:00
rasmith
b188bab441 [CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor (#34985)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2026-02-25 19:18:00 +00:00
Andreas Karatzas
8fd6975479 [ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results (#35049)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-02-25 16:48:37 +00:00
pushkar
5d18bf8b32 [Bugfix] Fix Harmony preamble visibility in Responses API (#32114)
Signed-off-by: Pushkar Patel <git@thepushkarp.com>
Signed-off-by: pupa <pupa@users.noreply.github.com>
2026-02-25 08:08:16 -08:00
Bhoomit
42489e43c2 [Misc][LoRA] Increase max vocab size limit to 258048 in logits processor (#34773)
Signed-off-by: Bhoomit Vasani <vbhoomit@amazon.com>
2026-02-25 23:30:55 +08:00
Mario Hong
af5e6afa0a [Bugfix] Fix step3p5 reasoning with interleaved thinking (#34211)
Signed-off-by: mariohong <mariohong128@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2026-02-25 15:13:01 +00:00
Benjamin Chislett
ee59a7c615 [Tests] Add GSM8k check to SpecDec E2E tests (#34772)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2026-02-25 07:51:14 -05:00
Yanwen Lin
675ec59aa9 [Bugfix][CPU] Fix basic unit tests failing in CPU platforms (#34677)
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-02-25 08:36:15 +00:00
Laura Wang
2465071510 [Perf] Add opt-in SM100 Oink RMSNorm custom-op path (#31828)
Signed-off-by: Laura Wang <3700467+Laurawly@users.noreply.github.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
2026-02-24 23:01:53 -08:00
Kunshang Ji
8ad54a991b [Platform] Add current_platform.num_compute_units interface (#35042)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
2026-02-24 22:22:49 -08:00
Chen Zhang
8fae54faff [Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache (#35157)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2026-02-24 22:00:19 -08:00
Harry Mellor
f7967577f5 Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM (#35203)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-02-24 22:00:06 -08:00
Andreas Karatzas
2ff3e436ad [Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors (#35231)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-02-25 05:52:44 +00:00
Rohan Potdar
f38f8c9742 [ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (#35180)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
2026-02-25 04:36:40 +00:00
Flora Feng
ec1d30c0f6 [Responses] Decouple SSE event helpers from Harmony context (#35148)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
2026-02-24 20:05:25 -08:00