Commit Graph

  • 5d18bf8b32 [Bugfix] Fix Harmony preamble visibility in Responses API (#32114) pushkar 2026-02-25 21:38:16 +05:30
  • 0788ff0a15 [Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support (#35085) haosdent 2026-02-25 23:31:45 +08:00
  • d72b0be33c [XPU]Fix for Qwen-OMNI crash (#35249) Chendi.Xue 2026-02-25 09:31:07 -06:00
  • 42489e43c2 [Misc][LoRA] Increase max vocab size limit to 258048 in logits processor (#34773) Bhoomit 2026-02-25 07:30:55 -08:00
  • af5e6afa0a [Bugfix] Fix step3p5 reasoning with interleaved thinking (#34211) Mario Hong 2026-02-25 23:13:01 +08:00
  • ee59a7c615 [Tests] Add GSM8k check to SpecDec E2E tests (#34772) Benjamin Chislett 2026-02-25 07:51:14 -05:00
  • 709eadbb0b Doc link typo (#35281) Joao Gante 2026-02-25 11:00:31 +00:00
  • 90fc7f9109 Fix custom processors that use deleted behaviour for Transformers v5 (#35107) Harry Mellor 2026-02-25 10:36:21 +00:00
  • 675ec59aa9 [Bugfix][CPU] Fix basic unit tests failing in CPU platforms (#34677) Yanwen Lin 2026-02-25 00:36:15 -08:00
  • 80e60a6133 [Doc] Suggest "--managed-python" flag when installing python using uv (#33069) Yanwen Lin 2026-02-25 00:19:43 -08:00
  • 26e722f906 [DOC][BugFix] Specfiy build dependency installation (#34513) jonoillar 2026-02-25 09:04:06 +01:00
  • 2c619e5e3f [Docs]Fix documentation formatting in architecture overview (#34679) lichuang 2026-02-25 16:00:15 +08:00
  • 8a685be8d9 docs: document committer proposal process in governance (#35225) Simon Mo 2026-02-24 23:58:48 -08:00
  • 2465071510 [Perf] Add opt-in SM100 Oink RMSNorm custom-op path (#31828) Laura Wang 2026-02-24 23:01:53 -08:00
  • cd43673668 [Perf] Optimize FP8 gemm of sm120. (#34424) wenshuai 2026-02-25 14:25:24 +08:00
  • 35d44b4557 [XPU]Support CUDAGraph on XPU Platform (#34482) Xinyu Chen 2026-02-25 14:22:52 +08:00
  • 8ad54a991b [Platform] Add current_platform.num_compute_units interface (#35042) Kunshang Ji 2026-02-25 14:22:49 +08:00
  • 92510edc32 remove cuda check in top_k_top_p_triton kernel (#35011) Kunshang Ji 2026-02-25 14:22:31 +08:00
  • a6c137521c [Misc] Add shard_id validation for MergedColumnLinear (#35055) Isotr0py 2026-02-25 14:12:28 +08:00
  • 4572a06afe [Misc] Enable weights loading tracking for quantized models (#35074) Isotr0py 2026-02-25 14:11:03 +08:00
  • 5cc29cfb8b [compile] Improve error message during artifacts load failure. (#35115) Zhengxu Chen 2026-02-25 01:01:09 -05:00
  • 8fae54faff [Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache (#35157) Chen Zhang 2026-02-24 22:00:19 -08:00
  • f7967577f5 Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM (#35203) Harry Mellor 2026-02-25 06:00:06 +00:00
  • af770b8e7b [Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest (#35237) pks 2026-02-25 07:00:03 +01:00
  • 2ff3e436ad [Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors (#35231) Andreas Karatzas 2026-02-24 23:52:44 -06:00
  • c2c4c4611a [FIX] fused moe with lora shared expert dual stream (1.07x otps) (#34933) Jhao-Ting Chen 2026-02-24 20:40:45 -08:00
  • f38f8c9742 [ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (#35180) Rohan Potdar 2026-02-24 22:36:40 -06:00
  • 89a77b1084 [ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility (#34447) v0.16.0 Andreas Karatzas 2026-02-12 12:47:34 -06:00
  • d3c1513f5f [ci] Use the right tag for CPU arm64 image (#34915) Kevin H. Luu 2026-02-19 19:59:15 -08:00
  • 5dbfbc967b [CI/Build] Fix gRPC version mismatch (#35013) Cyrus Leung 2026-02-22 03:14:41 +08:00
  • c86cdcbcd2 Revert "[Release 2.10] Update to Torch 2.10 - final release (#30525)" khluu 2026-02-24 20:28:53 -08:00
  • 3c9496f146 Revert "[Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available (#34153)" khluu 2026-02-24 20:28:45 -08:00
  • ec1d30c0f6 [Responses] Decouple SSE event helpers from Harmony context (#35148) Flora Feng 2026-02-24 23:05:25 -05:00
  • e3b2324ec4 [Frontend] Use init_app_state and FrontendArgs in run_batch (#32967) Pooya Davoodi 2026-02-24 19:40:39 -08:00
  • dbf0da817a [Core] Cleanup engine pause/sleep logic (#34528) Nick Hill 2026-02-24 19:33:34 -08:00
  • 3bbb2046ff [Bugfix] Fix expert_ids padding values in moe_align_block_size kernel (#35161) Xin Yang 2026-02-24 17:14:24 -08:00
  • 576fe50333 Adding Nemotron fp8 Triton MoE Config (#34674) yugong333 2026-02-24 15:56:38 -08:00
  • a0e50a4260 Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. (#34100) Hashem Hashemi 2026-02-24 15:35:21 -08:00
  • 9fa5b25a23 [Bug][DSV3.2] Always prepare metadata for DeepGEMM Sparse Attention (#35075) Benjamin Chislett 2026-02-24 17:55:22 -05:00
  • ea97750414 [CI] Fix Distributed Tests (#35236) Robert Shaw 2026-02-24 17:31:56 -05:00
  • 067c5d9ad1 [ROCm][CI] Added MI325 mirrors (#34923) Andreas Karatzas 2026-02-24 15:37:15 -06:00
  • f5972a872f [Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support (#33726) Benjamin Chislett 2026-02-24 12:49:56 -05:00
  • a9e15e040d Add @MatthewBonanni to CODEOWNERS (#35207) Matthew Bonanni 2026-02-24 12:45:10 -05:00
  • 542ca66357 Revert "[CI/Build] Remove redundant OpenTelemetry pip install from CI configs" (#35211) Lucas Wilkinson 2026-02-24 12:26:42 -05:00
  • fc8456c336 [CI/Build] Fix kernels test location (#35205) Cyrus Leung 2026-02-25 01:20:34 +08:00
  • 9ce8fad2a9 [Perf] Optimize Python Slice for Structured Output using islice instead of [:] (#33593) Wentao Ye 2026-02-24 12:02:36 -05:00
  • c38b8d5a31 Remove padding_index from models that don't use it for better Transformers v5 compatibility (#35189) Harry Mellor 2026-02-24 16:04:46 +00:00
  • 60da0e1544 [CI] Remove Duplicated Tests (#35199) Robert Shaw 2026-02-24 10:53:30 -05:00
  • 9609b1f18d Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 (#35053) danisereb 2026-02-24 17:45:13 +02:00
  • a0c7081695 Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe (#35088) danisereb 2026-02-24 17:25:44 +02:00
  • 34ce0ffd1f [CPU][Perf] Accelerate Attention head for s390x using vector intrinsics (#34434) R3hankhan 2026-02-24 20:55:39 +05:30
  • 0de5333989 Fix GLM4 parser tests (#34905) Robin Nabel 2026-02-24 14:27:42 +00:00
  • a87cc50859 [Attn,KV-cache] Use per-head scales in the attention selector (#34281) Eldar Kurtić 2026-02-24 15:02:43 +01:00
  • 761e63e541 [Frontend] Always pass supported_tasks to validation (#35186) Cyrus Leung 2026-02-24 20:16:33 +08:00
  • d12d201409 [Bugfix] Fix failing FunASR processor test (#35111) Isotr0py 2026-02-24 20:13:45 +08:00
  • b3ad37c5db [glm-asr] change defaults dummy audio size (#35108) eustlb 2026-02-24 13:13:33 +01:00
  • 14561fabfd [Perf] Optimize pooling model redundant copy, 1.8% throughput improvement (#35127) Wentao Ye 2026-02-24 07:13:11 -05:00
  • c77f3e1207 [compile] Save aot compile artifacts atomically. (#35117) Zhengxu Chen 2026-02-24 07:11:01 -05:00
  • 012dee9233 [Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) (#35147) Dor Huri 2026-02-24 14:10:32 +02:00
  • f1c664545b Make voxtral compile friendly (#33959) Tugsbayasgalan Manlaibaatar 2026-02-24 16:33:35 +08:00
  • c870eb9e0f [LoRA] Update LoRA expand kernel block_n calculation (#32621) Xin Yang 2026-02-23 23:17:53 -08:00
  • 6af03f2394 [Refactor] [1/N] Reorganize kernel abstraction directory (#34055) BadrBasowid 2026-02-24 14:47:22 +08:00
  • 1a6cf39dec [CI/Build] Remove redundant OpenTelemetry pip install from CI configs (#35032) Vlad Tiberiu Mihailescu 2026-02-24 00:24:11 -06:00
  • f91808ae0d [MM] Allow audio chunking for offline LLM (#34628) Nicolò Lucchesi 2026-02-24 06:04:28 +01:00
  • 33a0d43c71 [BUGFIX][Qwen3.5] Hardcode mlp.gate as not quantizable (#35156) Vadim Gimpelson 2026-02-24 07:42:24 +04:00
  • 80d93fd6da gpu_model_runner: Cache is_encoder_decoder from model config (#35099) pschlan-amd 2026-02-24 04:08:34 +01:00
  • ec85340531 [Quantization] Support FP8 MoE bias for models like GPT-OSS (#34906) Jia Guo 2026-02-23 19:07:47 -08:00
  • 2ff4e51152 [ROCm] AITER fused RoPE+KVCache (#33443) Rohan Potdar 2026-02-23 21:06:00 -06:00
  • 95642441d0 [Mamba1] - Change supports_update_block_table to True (#35054) Asaf Gardin 2026-02-24 05:05:57 +02:00
  • a7c9f7b7ec [Bugfix] Fix lora_ids in FusedMoE LoRA test (#35135) Xin Yang 2026-02-23 18:49:25 -08:00
  • a4bd661fb3 [Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default (#34924) Michael Goin 2026-02-23 20:34:41 -05:00
  • 3ef9fd0f98 [Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches (#35123) Michael Goin 2026-02-23 20:11:27 -05:00
  • 22a97e6613 [Perf] Improve default triton fused moe configs (#34846) Michael Goin 2026-02-23 19:01:28 -05:00
  • 596ed1f02e [RL] Validation for pause_mode='keep' (#34992) Aaron Hao 2026-02-23 13:30:56 -08:00
  • b8d8b7e934 [Misc] Monitor interface changes (#35113) Nicolò Lucchesi 2026-02-23 18:14:51 +01:00
  • 28c5e69ba0 Enforce that model is the first positional arg when --served-model-name is used (#34973) Harry Mellor 2026-02-23 16:38:05 +00:00
  • 864167d376 Fix custom processors that use deleted import for Transformers v5 (#35101) Harry Mellor 2026-02-23 16:38:00 +00:00
  • a2ba6a5244 [Bugfix] Fix prefix caching for Mamba 'all' mode (Nemotron models) (#34874) haosdent 2026-02-24 00:31:51 +08:00
  • c4f38696f7 Use Xet high performance mode for Transformers v5 (#35098) Harry Mellor 2026-02-23 16:19:30 +00:00
  • a7f341c323 [Bugfix] Fix MRotaryEmbedding missing truncate attr with YaRN scaling (#35080) haosdent 2026-02-24 00:05:52 +08:00
  • d13ece38d7 [CI] Skip Responses API (#34990) Robert Shaw 2026-02-23 10:46:45 -05:00
  • 5cc7c4452e [Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) (#30950) Mark McLoughlin 2026-02-23 15:01:07 +00:00
  • b95bb6927f [kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies (#34254) Eldar Kurtić 2026-02-23 15:37:55 +01:00
  • 392645454b [Refactor] Decouple TimingContext from InputProcessingContext (#35083) Cyrus Leung 2026-02-23 22:15:50 +08:00
  • 1e8438a89a [Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests (#35033) Eldar Kurtić 2026-02-23 15:04:34 +01:00
  • 8435b2e049 [ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) (#34302) Robert Shaw 2026-02-23 09:02:26 -05:00
  • b1b5e045df [XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend (#35010) Yan Ma 2026-02-23 21:06:44 +08:00
  • 5f68464f92 [ROCm][CI] Fix spec decode profile assertion and logprob test determinism (#35043) Andreas Karatzas 2026-02-23 07:05:54 -06:00
  • aa08a30fc9 [CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig (#35060) Vincent Gimenes 2026-02-23 14:05:36 +01:00
  • 7f40e9e516 [Refactor] Remove dead private func _fp8_perm and _extract_mask_for_item (#35068) Wentao Ye 2026-02-23 08:05:20 -05:00
  • 103e614b14 Fix pipeline parallel with embed scaling in the Transformers modelling backend (#35094) Harry Mellor 2026-02-23 13:04:47 +00:00
  • 54e2f83d0a [Feature] Lazy import for the "mistral" tokenizer module. (#34651) Neil Schemenauer 2026-02-23 00:43:01 -08:00
  • e631f8e78e fix: Apply embedding_multiplier to inputs_embeds (#34813) Gabe Goodhart 2026-02-23 01:42:46 -07:00
  • e97c46a92d [BugFix]: Fix local mypy issues (#34739) Martin Hickey 2026-02-23 08:40:29 +00:00
  • 7291d1b288 [Bugfix] Fix kernel benchmark (#33752) Jee Jee Li 2026-02-23 13:18:08 +08:00
  • 987506bca6 [Refactor] Simplify dummy data generation (#35025) Cyrus Leung 2026-02-23 12:55:27 +08:00
  • c645e9a214 [Model Runner V2] Remove propose_draft method (#35070) Woosuk Kwon 2026-02-22 18:27:12 -08:00
  • 944ffb5968 [Model Runner V2][Minor] Remove redundant do_spec_decode field (#35039) Nick Hill 2026-02-22 16:18:04 -08:00
  • 2bcf71b9c0 [Spec Decode] Reduce TP communication for speculative decoding draft token generation (#34049) qizixi 2026-02-22 14:59:16 -08:00
  • b7892a3bef [Model] Add NVFP4 quantization support for Step3.5-Flash (#34478) tacos8me 2026-02-22 14:30:46 -05:00