Commit Graph

  • e09546cf05 [Frontend] Exploit tokenizers "new stream" in FastIncrementalDetokenizer (#34217) Nick Hill 2026-02-11 02:03:24 -08:00
  • 786806dd44 [Doc] Update Marlin support matrix for Turing (#34319) Tianqi Ren 2026-02-11 17:03:41 +08:00
  • 79504027ef [Misc] Bump fastsafetensors version for latest fixes (#34273) Nick Hill 2026-02-11 00:30:09 -08:00
  • addac0e653 [torch.compile] Enable AR+rms fusion by default available for -O2 (#34299) Luka Govedič 2026-02-11 03:30:00 -05:00
  • 675a22ed66 [Chore] Move BaseRenderer to base.py (#34308) Cyrus Leung 2026-02-11 16:29:51 +08:00
  • cb9574eb85 [XPU][9/N] clean up existing ipex code/doc (#34111) Kunshang Ji 2026-02-11 16:27:15 +08:00
  • 21dfb842d7 [model] support FunASR model (#33247) AllenDou 2026-02-11 15:37:09 +08:00
  • d1b837f0ae [CPU] Enable FP16 (Half dtype) support for s390x (#34116) R3hankhan 2026-02-11 12:11:42 +05:30
  • 0b20469c62 [Bugfix] Fix weight naming in Qwen3.5 (#34313) Roger Wang 2026-02-10 21:37:14 -08:00
  • d7982daff5 [Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides (#34279) Tyler Michael Smith 2026-02-11 00:15:52 -05:00
  • 9b17c57460 [ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast (#34298) Robert Shaw 2026-02-11 00:00:00 -05:00
  • 1b3540e6c6 Threshold fix wvSplitk for occasional CI fails (#34013) Hashem Hashemi 2026-02-10 19:59:14 -08:00
  • 7a048ee65f [Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 (#34149) Matthias Gehre 2026-02-11 04:58:56 +01:00
  • c9a1923bb4 [Plugin] Simplify IO Processor Plugin interface (#34236) Cyrus Leung 2026-02-11 11:47:39 +08:00
  • b482f71e9f [XPU][7/N] enable xpu fp8 moe (#34202) zofia 2026-02-11 11:33:59 +08:00
  • 1485396abb [Kernel] Apply 256bit LDG/STG To Activation Kernels (#33022) Дзержи́нский 2026-02-11 11:31:51 +08:00
  • 5ee5c86eeb [Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast (#33884) Kebe 2026-02-11 12:31:36 +09:00
  • b5dcb372e4 [Misc] Clean up validation logic in input processor (#34144) Cyrus Leung 2026-02-11 11:29:29 +08:00
  • 066c6da6a0 [WideEP] Fix nvfp4 DeepEP High Throughput All2All backend (#33738) Tyler Michael Smith 2026-02-10 22:15:43 -05:00
  • e30cedd44b [torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter (#34093) Richard Zou 2026-02-10 22:15:40 -05:00
  • 3bcd494ef4 [Redo] Add --trust-remote-code to dataset bench args (#34251) Cyrus Leung 2026-02-11 11:10:12 +08:00
  • 0e725a7d22 [Bugfix] Fix Worker.load_model context-manager composition for sleep mode (#34021) tianshu-Michael-yu 2026-02-10 19:07:51 -08:00
  • ba0511fd80 [Misc] Add run one batch script that supports profiling (#32968) Lucas Wilkinson 2026-02-10 19:29:49 -07:00
  • 4a1550d22d [ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline (#34280) Micah Williamson 2026-02-10 19:08:11 -06:00
  • d1481ba783 [MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner (#32344) bnellnm 2026-02-10 19:51:07 -05:00
  • dc6de33c3d [CI] Add pip caching to cleanup_pr_body workflow (#32979) 7. Sun 2026-02-11 08:45:28 +08:00
  • c4b9e6778f [Misc] Add pre-commit hook to catch boolean ops in with-statements (#34271) Tyler Michael Smith 2026-02-10 18:13:20 -05:00
  • 341eed3d30 [torch.compile] Disable recursive pre_grad_passes (#34092) Richard Zou 2026-02-10 18:02:31 -05:00
  • 6f2f59f2b3 [Misc][Spec Decode] support different load config for draft model (#34022) Zhengkai Zhang 2026-02-10 14:52:43 -08:00
  • bb2fc8b5e7 [BugFix] Fix async EPLB hang with DeepEP LL all2all backend (#32860) Ilya Markov 2026-02-10 23:34:47 +01:00
  • 67132945bb [Perf] Move eplb rebalance algo to async thread (#30888) Ilya Markov 2026-02-10 23:19:10 +01:00
  • f0ca0671c7 [Feature] Warn about unrecognized environment variables (#33581) Gregory Shtrasberg 2026-02-10 15:45:38 -06:00
  • 578977bb5e [SM100] Resubmit FMHA FP8 prefill for MLA (#31195) Pavani Majety 2026-02-10 13:18:43 -08:00
  • 9615575afc [Bugfix] Fix mamba cache dtype for Qwen3.5 (#34200) Roger Wang 2026-02-10 13:12:31 -08:00
  • 4293c00b84 [Benchmarks] Fix attention benchmark smoke test (#34269) Matthew Bonanni 2026-02-10 16:04:07 -05:00
  • 506ad7d7c1 [Bugfix] Fix weights offloading for sleep mode (#32947) J Seppänen 2026-02-10 22:38:17 +02:00
  • fdd6f2ad58 Convert online APIs to use Renderer (#34084) Reagan Lee 2026-02-10 11:44:31 -08:00
  • 33bcd3dc3b [Misc] Introduce ec_both role EC (encoder cache) connector (#34182) Qi Wang 2026-02-10 10:55:35 -08:00
  • 1f5febb4b8 [UX nit] Fix non-default api_server_count message (#34152) Michael Goin 2026-02-10 13:35:58 -05:00
  • ae871ca923 Minor cleanup for Voxtral (#34247) Andy Lo 2026-02-10 18:18:30 +00:00
  • a2443de5fa [Model Runner V2] Use pinned memory for write_contents (#34222) Woosuk Kwon 2026-02-10 08:55:22 -08:00
  • f84a2a8f31 [Docs] Speed up build environment set-up (#34240) Harry Mellor 2026-02-10 17:34:43 +01:00
  • 000214c4bb [BUGFIX] Fix accuracy bugs in Qwen3-Next MTP (#34077) Vadim Gimpelson 2026-02-10 19:57:11 +04:00
  • c5a66d1697 [Core][BugFix] Fix PP KV cache sharding memory validation (#33698) junuxyz 2026-02-11 00:46:24 +09:00
  • afdce12c89 [Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680) Roberto L. Castro 2026-02-10 16:29:52 +01:00
  • 82e11973cc [compile] Enable AOT compile with 2.10 in trunk. (#34155) Zhengxu Chen 2026-02-10 10:24:42 -05:00
  • b129136c7a [ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations (#29008) xuebwang-amd 2026-02-10 23:08:05 +08:00
  • 599e4335a4 Support benchmarking of Geospatial models (#33922) mgazz 2026-02-10 15:04:16 +00:00
  • a1946570d8 add --insecure arg to the vllm bench to skip TLS (#34026) Fan Yang 2026-02-10 06:23:52 -08:00
  • d0bc520569 Bump mamba-ssm version in CI for Transformers v5 compatibility (#34233) Harry Mellor 2026-02-10 14:46:01 +01:00
  • 748625cdaf [V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input (#34220) Krish Gupta 2026-02-10 18:35:32 +05:30
  • 61413973e8 Stop testing for slow tokenizers as they will not exist soon (#34235) Harry Mellor 2026-02-10 13:08:20 +01:00
  • 94de871546 [Misc] allow specify is_mm_prefix_lm in hf_config (#34215) Phúc H. Lê Khắc 2026-02-10 18:16:21 +07:00
  • e042d7e685 Add flagos in MiniCPM-o (#34126) tc-mb 2026-02-10 18:51:48 +08:00
  • ae4e280602 [Bugfix] Fix FI kernelchunk_gated_delta_rule output shape for Qwen3.5 (#34219) Roger Wang 2026-02-10 02:41:24 -08:00
  • cbea11c9f0 [Docs] Fix format error in KV load failure recovery doc (#34137) zzaebok 2026-02-10 18:16:26 +08:00
  • 2c32558a3c [Bugfix] Fix --trust-remote-code conflict (#34218) Cyrus Leung 2026-02-10 16:29:10 +08:00
  • 5f970120f0 [Bugfix] Fix memory inconsistency in cross-process shared memory (#32022) Zetong Li 2026-02-10 16:22:03 +08:00
  • 998e2d91f8 Revert #34208 (#34216) Cyrus Leung 2026-02-10 15:59:04 +08:00
  • e1060a71a1 [Perf] Optimize detokenizer python logic (#32975) Wentao Ye 2026-02-10 02:54:41 -05:00
  • 97fa8f6590 [BugFix] Avoid prefix cache hit in the same schedule step for mamba layers (#29387) Chen Zhang 2026-02-09 23:41:16 -08:00
  • dab1de9f38 [Frontend][CI] Consolidate instrumentator entrypoints (#34123) wang.yuqi 2026-02-10 15:30:19 +08:00
  • 8d48d0a9d9 [Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 (#34190) Balaxxe 2026-02-10 00:06:30 -07:00
  • 9608844f96 [responsesAPI] fix simpleContext streaming output_messages (#34188) Andrew Xia 2026-02-09 22:53:07 -08:00
  • f69b903b4c [Bugfix] Add --trust-remote-code to dataset bench args (#34208) Cyrus Leung 2026-02-10 14:37:50 +08:00
  • 81e217fe6b [Bugfix] Fix DP Attention Padding in Dummy Run (#34187) Lucas Wilkinson 2026-02-09 21:29:39 -08:00
  • ab97bcf662 [CI/Build] Relax test_mcp_tool_call (#34204) Cyrus Leung 2026-02-10 13:18:57 +08:00
  • 25e48a3aae [Doc] Update usage of --limit-mm-per-prompt (#34148) Cyrus Leung 2026-02-10 13:12:13 +08:00
  • 8a5e0e2b2b [Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching (#34183) Roger Wang 2026-02-09 21:03:32 -08:00
  • 4cde2e0159 [ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection (#34108) Andreas Karatzas 2026-02-09 22:50:20 -06:00
  • 047a457fa4 [Bugfix] Adopt ChunkGatedDeltaRule for Qwen3.5 (#34198) Roger Wang 2026-02-09 19:47:54 -08:00
  • e94ec59733 [LMCache] Token Base IPC API (#34175) Yuwei An 2026-02-09 17:18:42 -08:00
  • 13397841ab [structured output] validate unsupported json features first (#33233) Ning Xie 2026-02-10 07:49:09 +08:00
  • c60f8e3b49 [Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available (#34153) Gregory Shtrasberg 2026-02-09 17:38:54 -06:00
  • 5e75a14a66 [Doc] Add DCP support to attention backend doc (#33936) Michael Goin 2026-02-09 18:33:43 -05:00
  • e7e52781ff [ModelRunner V2][BugFix] Fix max_query_len calculation (#34167) Nick Hill 2026-02-09 13:47:17 -08:00
  • bb9f97308d [torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. (#33945) Charlie Fu 2026-02-09 15:15:43 -06:00
  • 4d39650961 [ROCm] update triton branch to support gpt-oss models for gfx11xx devices (#34032) Hongxia Yang 2026-02-09 14:36:30 -05:00
  • 8fd31f6245 [Bugfix] Voxtral prompt/audio placeholder alignment (#34140) Artus Krohn-Grimberghe 2026-02-09 20:30:38 +01:00
  • eadb4e868b [Bugfix] Avoid duplicate k-proj weight emission in helper (#34142) Artus Krohn-Grimberghe 2026-02-09 20:17:44 +01:00
  • 285bab4752 [Kernel] use flashinfer for gdn prefill (#32846) Jiangyun Zhu 2026-02-10 01:17:25 +08:00
  • 995bbf38f1 [Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) (#34087) TomerBN-Nvidia 2026-02-09 18:44:18 +02:00
  • d4f123cc48 [Kernel] FlashInfer: switch allreduce fusion to unified API (#33985) Mohammad Miadh Angkad 2026-02-09 23:43:24 +08:00
  • cb62e86f83 Add NUMA Core binding in nixl_connector for CPU xPyD (#32365) ZhengHongming888 2026-02-09 07:39:12 -08:00
  • 781ddf7868 [CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 (#34031) Luka Govedič 2026-02-09 10:05:14 -05:00
  • 64a9c2528b [UX] Add --language-model-only for hybrid models (#34120) Roger Wang 2026-02-09 06:57:33 -08:00
  • d0d97e2974 [Misc] Fix up attention benchmarks (#33810) Lucas Wilkinson 2026-02-09 06:42:03 -08:00
  • 9562912cea [MODEL] Adding Support for Qwen3.5 Models (#34110) JJJYmmm 2026-02-09 21:12:58 +08:00
  • 9bdb06b436 [XPU][6/N] add xpu scaled_mm kernel (#34117) zofia 2026-02-09 20:17:35 +08:00
  • caad9f1e01 [Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul (#33901) Nikhil Gupta 2026-02-09 10:04:41 +00:00
  • 1d5922fade [ASR] Fix audio benchmark and add RTFx metric (#32300) Ekagra Ranjan 2026-02-09 05:02:37 -05:00
  • 3025b3cebb [CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr (#34107) Andreas Karatzas 2026-02-09 03:37:04 -06:00
  • 978a37c823 [Model] GLM adaptation (#34124) Jee Jee Li 2026-02-09 17:32:52 +08:00
  • 5a5c43511a fix(cpu): fix mla_decode compilation on x86 without AVX512 (#34052) ihb2032 2026-02-09 16:55:41 +08:00
  • d9bede0314 [BugFix] Fix fastsafetensors TP all procs using all GPUs (#34070) Nick Hill 2026-02-08 23:15:46 -08:00
  • 22b64948f6 [Frontend][last/5] Make pooling entrypoints request schema consensus. (#31127) v0.16.0rc1 wang.yuqi 2026-02-09 14:42:38 +08:00
  • 7c233dbb36 [Tiny] Rename encoder budget file to more specific name (#34103) Reagan Lee 2026-02-08 19:48:19 -08:00
  • a75a5b54c7 [bug-fix] supported_tasks is breaking backward compatibility at init_app_state (#34027) kourosh hakhamaneshi 2026-02-08 17:46:46 -08:00
  • f97ca67176 [Release 2.10] Update to Torch 2.10 - final release (#30525) Andrey Talman 2026-02-08 16:51:09 -05:00
  • 084aa19f02 Add support for ModelOpt MXFP8 dense models (#33786) danisereb 2026-02-08 21:16:48 +02:00