Commit Graph

  • 9fa6c68fa6 [ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends (#35334) Gregory Shtrasberg 2026-02-27 15:32:55 -06:00
  • 2ce6f3cf67 [Feat][RL][2/2] Native Weight Syncing API: IPC (#34171) Aaron Hao 2026-02-27 12:45:21 -08:00
  • 1f3dbd95fd [Bugfix][Model] Fix gpt-oss batch invariance (#35404) Jakub Zakrzewski 2026-02-27 21:41:24 +01:00
  • 1d532f9d8f [DP] Only use DP padding when cudagraphs are actually used (#34102) Lucas Wilkinson 2026-02-27 15:14:31 -05:00
  • 234a65b781 [Bugfix] Add monkeypatch to prevent race condition from writing (#35420) Lucas Kabela 2026-02-27 11:51:36 -08:00
  • 2decec9856 [Transformers backend] Ignore MTP weights when num_nextn_predict_layers=0 (#34888) SteadfastAsArt 2026-02-28 03:39:23 +08:00
  • 29b35477b0 [compile] Fix caching error over pytree slice node. (#35308) Zhengxu Chen 2026-02-27 14:34:16 -05:00
  • b1d9f5372d [Model Runner V2] Warmup kernels (#35172) Nick Hill 2026-02-27 10:43:30 -08:00
  • fd6de37fca [BugFix] Fix 3D rope in transformers backend (#35097) Raushan Turganbay 2026-02-27 19:34:49 +01:00
  • c8aca0c9e1 Support parakeet as audio encoder for nemotron-nano-vl (#35100) Netanel Haber 2026-02-27 20:07:38 +02:00
  • b602e4f299 [Doc] Fix link to Llama chat template for usability (#35525) Martin Hickey 2026-02-27 17:51:09 +00:00
  • 157722da75 [perf] Use pinned memory for async H2D transfer in do_mamba_copy_block (#35480) Huamin Li 2026-02-27 09:50:37 -08:00
  • 1d897ff04f [Misc] Fill in some v1 CODEOWNERS gaps (#35524) Nick Hill 2026-02-27 09:34:37 -08:00
  • 905d76b51d [Model] Add huggingface skt/A.X-K1 model (#32407) fort726 2026-02-28 02:26:02 +09:00
  • 9098ce690c [Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching (#34390) Yanan Cao 2026-02-27 09:21:35 -08:00
  • 876312f0b5 [Core] Fix gpu_worker.py pre-commit errors (#35312) Nick Hill 2026-02-27 07:54:24 -08:00
  • 5de98abc12 Add @BoyuanFeng to CODEOWNERS (#35317) Boyuan Feng 2026-02-27 07:53:47 -08:00
  • 9251ed5c4f [Bugfix] Handle case when kimi ends reasoning with a tool call (#33646) Koushik Dutta 2026-02-27 06:58:28 -08:00
  • e8249378e4 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests (#35487) Yueqian Lin 2026-02-27 09:48:25 -05:00
  • 6d4f9d3ad5 [Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp (#35082) haosdent 2026-02-27 22:27:06 +08:00
  • fbe3f0120a Revert "Add GlmOcrConfig for GLM-OCR model type recognition" (#35512) Harry Mellor 2026-02-27 14:13:27 +00:00
  • 66c1751d13 [compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism (#35410) Jason Li 2026-02-27 05:36:37 -08:00
  • 6467b635b6 [Bugfix] Add missing activation attr to RMSNormGated (#35423) Tib 2026-02-27 13:53:35 +01:00
  • 9c3fe9936b Flashinfer cuDNN backend for Qwen3 VL ViT attention (#34580) Max Hu 2026-02-27 20:20:23 +08:00
  • b66a74649e [Bugfix] Replace assert with ValueError for response_format validation in completions endpoint (#35456) Umut Polat 2026-02-27 11:01:06 +03:00
  • 07bdabef03 [Bugfix] Use 'sum' reduction instead of 'avg' in Async TP reduce-scatter (#33088) Wang Xingran 2026-02-27 15:06:08 +08:00
  • a572baff5e [Model Performance] Add Qwen3MoE tuned MoE configs for H200 (#35457) Chengyi Nie 2026-02-26 21:51:14 -08:00
  • 516cf26698 [Bug] correct out dtype of rms_norm_gated native path (#35369) zofia 2026-02-27 13:19:51 +08:00
  • 487e5c51f7 [Bugfix] disable allreduce_rms_fusion by default when pp size > 1 (#35424) Jiangyun Zhu 2026-02-27 12:18:52 +08:00
  • 1a8c71674e [BugFix] Repo utils debug print patch (#35434) Daniel Huang 2026-02-26 19:50:56 -08:00
  • 062b789632 [Bug] Fix outdated links in source code (#35314) Wentao Ye 2026-02-26 22:50:46 -05:00
  • a532c83849 use 'max_active_experts' for moe lora input size (#33197) gnovack 2026-02-26 19:50:43 -08:00
  • 1e5ad9b74f [Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping (#35413) Jee Jee Li 2026-02-27 11:46:30 +08:00
  • cabdaa7619 [Misc] Move GPUModelRunner.prepare_kernel_block_sizes to utils (#35400) Nicolò Lucchesi 2026-02-27 04:42:51 +01:00
  • 06be53563b [Core]Extract is_last_rank in Ray for tpu to override (#33012) Chenyaaang 2026-02-26 19:18:52 -08:00
  • c29ee9c326 [compile] Invalidate cache for cpu flags (#35119) Angela Yi 2026-02-26 18:54:11 -08:00
  • d43048ce05 [Bugfix] Emit reasoning_part events in simple streaming path for Resp… (#35184) daniel-salib 2026-02-26 17:49:06 -08:00
  • 4fec53cfcb [CI] Actually run tests/kernels/quantization/test_block_fp8.py in CI (#34274) Michael Goin 2026-02-26 19:58:03 -05:00
  • 38c498b8e3 [Performance] Cublas Bf16 Gate with Fp32 Output (#35121) roikoren755 2026-02-27 02:51:28 +02:00
  • 56a6371706 [Update] Use FlashInfer fast_decode_plan directly instead of replication (#34687) Andrii Skliar 2026-02-27 01:31:43 +01:00
  • 6283021142 [Bugfix] Fix KV Scale loading for MLA Models (#35430) Pavani Majety 2026-02-26 15:38:19 -08:00
  • 01923eec70 [ROCm][Quantization] GPT OSS Upstream MoE wmxfp4_afp8 with static scales (#30357) Aleksandr Malyshev 2026-02-26 14:50:16 -08:00
  • 31fb6f43da [Kernel][perf] optimize NCCL symm_mem vs custom_AR selection thresholds (#33839) pkousha 2026-02-26 14:35:58 -08:00
  • eb19955c37 [WideEP] Remove pplx all2all backend (#33724) Tyler Michael Smith 2026-02-26 17:30:10 -05:00
  • 0f2f24c8b2 [Bugfix] Fix MessageQueue connect_ip for cross-node data parallelism (#35429) Lucia Fang 2026-02-26 14:08:16 -08:00
  • d0105b84f0 add mixed precision support for modelopt (#35047) sychen52 2026-02-26 13:56:24 -08:00
  • 832a780f3a Nemotron: use per-layer config in NemotronHMLPDecoderLayer for heterogeneous models (#35396) danielafrimi 2026-02-26 23:55:19 +02:00
  • 98217b09f9 [Performance] Extract KV cache update op from flashinfer forward (#35422) ElizaWszola 2026-02-26 22:29:01 +01:00
  • 967572dd5f fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning (#35230) 不做了睡大觉 2026-02-27 04:30:45 +08:00
  • 3d66502e1b [Model Runner V2] Prepare attn metadata in ModelState [2/N] (#35383) Woosuk Kwon 2026-02-26 11:47:02 -08:00
  • c66aa48e99 [Model Runner V2] Add model states [1/N] (#35350) Woosuk Kwon 2026-02-26 11:20:35 -08:00
  • b6d5a17298 [Model Runner V2] Fix error-handling (#35063) Nick Hill 2026-02-26 11:00:19 -08:00
  • 5e58bdc711 [Bugfix] Remove erroneous lower bound on LoRA vocab size constraint (#35354) Lucas Wilkinson 2026-02-26 13:44:50 -05:00
  • a1f53addb1 [BugFix] Align fused MoE-LoRA kernel config with actual weight shapes (#34396) Runkai Tao 2026-02-26 13:03:10 -05:00
  • 05970c772c [Refactor] Remove dead code for attention benchmark script (#35418) Wentao Ye 2026-02-26 12:53:46 -05:00
  • d940607629 [Core] Support min_tokens with speculative decoding (#32642) Yiliu Dong 2026-02-27 01:31:28 +08:00
  • 99c7892c5b [Perf] Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement (#35330) Wentao Ye 2026-02-26 12:14:54 -05:00
  • ec8f943db1 Add GlmOcrConfig for GLM-OCR model type recognition (#34982) hujia177 2026-02-26 09:04:42 -08:00
  • f2ad952f40 [BugFix][kv_offload]: Fix kernel block size detection (#35125) Or Ozeri 2026-02-26 18:29:34 +02:00
  • 9e2cabdf9c [ROCm] Update the torch version in rocm_build.txt to use the official 2.10 release (#34387) Sage Moore 2026-02-26 08:28:45 -08:00
  • ec8ab9d254 [ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers (#34157) Douglas Lehr 2026-02-26 10:00:49 -06:00
  • 05972ea7e5 [Refactor] Remove dead or duplicate func utils or variables (#35318) Wentao Ye 2026-02-26 10:57:56 -05:00
  • 111d869069 [Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model (#35297) Jakub Zakrzewski 2026-02-26 15:17:17 +01:00
  • 7fea7250a4 [Bug] Fix missing <think> tag after tool call in MiniMax 2.1 (#35352) stingoChen 2026-02-26 22:11:07 +08:00
  • 845ee348ef [Misc] Standardize handling of mm_processor_kwargs.size (#35284) Cyrus Leung 2026-02-26 21:05:46 +08:00
  • ec13e549d3 [Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic (#35275) Asaf Gardin 2026-02-26 14:22:06 +02:00
  • c6ca51598a [Bugfix] fix device_name for routing replay (#34336) Li-Yongwen 2026-02-26 20:18:38 +08:00
  • c0615a296d [Bugfix] Fix Qwen2.5-Omni and Qwen3-Omni mixed-modality embed regression (#35368) Yueqian Lin 2026-02-26 06:58:23 -05:00
  • 01914445b0 Remove bc-lint (#35274) Harry Mellor 2026-02-26 11:01:01 +00:00
  • 5281713e11 [XPU] use fixed UMD version in dockerfile.xpu (#35392) Kunshang Ji 2026-02-26 18:54:55 +08:00
  • 32693db8ce [Bugfix] [Qwen3.5]Fix Qwen3.5 FP8 quantization: tuple shard_id weight loading (#35289) HZY 2026-02-26 18:26:15 +08:00
  • e03ddcfbd4 [Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le (#35081) Akash kaothalkar 2026-02-26 15:51:24 +05:30
  • 02acd16861 [Benchmarks] Plot benchmark timeline and requests statistics (#35220) Sophie du Couédic 2026-02-26 11:17:43 +01:00
  • ab87f85231 [Model] Ring 2.5 (#35102) Jiangyun Zhu 2026-02-26 18:17:11 +08:00
  • 3827c8c55a [Test] Add tests for n parameter in chat completions API (#35283) v0.16.1rc0 Krish Gupta 2026-02-26 14:44:07 +05:30
  • ade81f17fe [Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash (#35250) Kevin McKay 2026-02-26 02:11:07 -06:00
  • 6042e66cd5 [ROCm] Add extra step in config initialization to populate custom ops before compilation config init (#34848) Gregory Shtrasberg 2026-02-26 02:05:40 -06:00
  • 9f9a675b23 [XPU][8/N] Fix kernel bugs in XPU LoRA and MOE LORA (#34115) Chaojun Zhang 2026-02-26 15:46:44 +08:00
  • a07c4c5939 [BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 (#35298) Ofir Zafrir 2026-02-26 09:15:16 +02:00
  • d3a51da92a [Benchmark] Simplify SLA scan (#35306) Cyrus Leung 2026-02-26 14:35:41 +08:00
  • 186ea22efe [Misc][Harmony] Move Responses API only harmony utils to responses/harmony.py (#35339) Flora Feng 2026-02-26 01:35:16 -05:00
  • 4a9c07a0a2 [BugFix] anthropic/serving_messages: fix tool call arguments streaming (#34887) Daniele 2026-02-26 06:39:48 +01:00
  • 9d37941017 [torch.compile] Sequence Parallelism threshold compile ranges (#28672) Jason Li 2026-02-25 21:00:12 -08:00
  • 4171ff6dd9 [CPU][Feat] Enable KleidiAI INT8_W4A8 for all input dtypes (#34890) Fadi Arafeh 2026-02-26 05:00:10 +00:00
  • 13025e71e8 [Model Runner V2] Add coding style guide (#35325) Woosuk Kwon 2026-02-25 20:42:40 -08:00
  • 71dfce6aa6 [Kernel] Refactor FlashInfer allreduce for mnnvl backend (#34109) Hanjie Qiu 2026-02-25 19:17:20 -08:00
  • 2aa4140402 openpangu-vl support video input (#34134) hujiaxin0 2026-02-26 11:08:09 +08:00
  • 86c3b5a808 [BugFix] Fix fp4 quant kernel on CUDA 12.8 (#35210) Roberto L. Castro 2026-02-26 03:32:50 +01:00
  • 160424a937 [Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs (#33992) Seungmin Kim 2026-02-26 11:15:51 +09:00
  • 9511a3f8ee [Bugfix] Fix AttributeError in SMControlContextManager (#35338) Lucas Wilkinson 2026-02-25 21:01:10 -05:00
  • de527e1cec [UX] Add --moe-backend arg for explicit kernel selection (#33807) Michael Goin 2026-02-25 20:44:44 -05:00
  • 1976356ee6 [MoE Refactor] MXFP4 Cutlass Experts to MK (#34542) Yongye Zhu 2026-02-25 17:32:39 -08:00
  • cbf8f7028c [UX] Add --performance-mode {balanced,interactivity,throughput} (#34936) Michael Goin 2026-02-25 20:28:31 -05:00
  • 6831650c40 [offloader] v2: Hide weight onloading latency via prefetching (#29941) Ming Yang 2026-02-25 17:20:59 -08:00
  • ed42507f6d [ROCm][CI] Amending deletion of AMD mirror (#35322) Andreas Karatzas 2026-02-25 16:17:56 -06:00
  • 9571e99945 [ROCm][CI] Extending attention backend coverage for Eagle spec decode tests (#35265) Andreas Karatzas 2026-02-25 16:16:18 -06:00
  • c97234c08b fix(mxfp4): Disable monolithic path for TRITON backend with EP (#34270) Elizabeth Thomas 2026-02-25 15:33:42 -06:00
  • b188bab441 [CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor (#34985) rasmith 2026-02-25 13:18:00 -06:00
  • 15d76f74e2 Revert "[Misc] Enable weights loading tracking for quantized models" (#35309) Lucas Wilkinson 2026-02-25 12:20:15 -05:00
  • 8fd6975479 [ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results (#35049) Andreas Karatzas 2026-02-25 10:48:37 -06:00