Commit Graph

  • 4a9952ec1b [Bugfix] Add quant_config in ViT of Kimi-K2.5 (#34501) LoganJane 2026-02-14 00:05:34 +08:00
  • 1dae7b7843 [Bugfix] Exclude language_model_only key from MM AOT compile hash but include in model one (#34508) Roger Wang 2026-02-13 05:59:00 -08:00
  • 5885e330ef [Misc] Port Qwen3.5 Configs (#34512) Roger Wang 2026-02-13 05:24:25 -08:00
  • 071d863e20 Extend ColBERT support to non-standard BERT backbones (#34170) Ilya Boytsov 2026-02-13 10:53:09 +01:00
  • 0916e7960b [GDN] Use CPU tensors to build GDN metadata (#34498) Woosuk Kwon 2026-02-13 01:24:45 -08:00
  • 3d2a026fd0 [Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement (#33368) Wentao Ye 2026-02-13 03:38:16 -05:00
  • dddbff4624 [Core] Move pause and resume functions into engine (#34125) Aaron Hao 2026-02-13 00:15:10 -08:00
  • 47e9b63e1a [KVConnector] Clean up redundant code in KV connectors (#34147) Martin Hickey 2026-02-13 08:14:30 +00:00
  • 934acddef9 [Perf] fused_moe: add int4_w4a16 benchmark support and tuning config (#34130) Matthias Gehre 2026-02-13 09:14:27 +01:00
  • 742d214d6e [Bugfix] fix the import path in moe test utils.py (#34245) Marek Michalowski 2026-02-13 08:13:45 +00:00
  • 4137c5dfa7 [Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode (#34418) haosdent 2026-02-13 16:13:22 +08:00
  • 7a8a46ddcb [BugFix] Fix and optimize max_num_blocks_per_req calculation for MambaSpec (#34440) Harry Huang 2026-02-13 16:13:14 +08:00
  • bcf0731aa0 [New Model] support new model ovis2.6 (#34426) myselvess 2026-02-13 16:12:45 +08:00
  • ec090c2429 [Refactor] Call renderer for online IO processor request (#34490) Cyrus Leung 2026-02-13 14:48:45 +08:00
  • eea3024f43 [Bugfix] Fix mamba state dtype setting for Qwen3-Next and Qwen3.5 (#34489) Roger Wang 2026-02-12 22:48:42 -08:00
  • 2f308214c0 [Refactor] Pass full VllmConfig to Renderer (#34485) Cyrus Leung 2026-02-13 14:48:38 +08:00
  • 1b4e8e53f8 [CI/Build] Fix CUDA re-initialization error in distributed model tests (#34491) Cyrus Leung 2026-02-13 14:43:53 +08:00
  • dcf6ee8592 [Bugfix] Fix encoder cache underestimation for GLM-4V/GLM-OCR single image (#34483) haosdent 2026-02-13 13:04:06 +08:00
  • 372b2e762a [Bugfix] Standardize getting number of image patches/tokens (#34358) Cyrus Leung 2026-02-13 12:47:01 +08:00
  • 6afa587d31 [ROCm][CI] Fix serving tokens test failures (#34047) Andreas Karatzas 2026-02-12 21:27:53 -06:00
  • 94ed6cf6ea Add new sections to CODEOWNERS (#34309) Cyrus Leung 2026-02-13 10:39:28 +08:00
  • bf37812ca7 [Hybrid] Fix and optimize block-aligned splitting in mamba cache align mode (#33706) Harry Huang 2026-02-13 10:21:52 +08:00
  • b86bf4417e [Bugfix] Fix Random Dataset Prefix Length Inaccuracy (#33907) Frank Wang 2026-02-12 18:21:19 -08:00
  • de13dd781f [Kernel] [Helion] [5/N] Add Helion Autotuning infrastructure (#34025) Yanan Cao 2026-02-12 18:21:05 -08:00
  • 62788f99a4 [Bugfix] Delete unused redundant code in Kimi-K2.5 (#34427) LoganJane 2026-02-13 10:18:42 +08:00
  • ea5ff3a1f6 [Refactor] Simplify BOS/EOS token handling (#34435) Cyrus Leung 2026-02-13 10:18:24 +08:00
  • 04ea31baab [Bugfix] Remove assert that's no longer valid (#34443) bnellnm 2026-02-12 21:18:15 -05:00
  • 6f019e6e0a [BugFix] Add block_size validation for mamba cache align mode (#34445) Harry Huang 2026-02-13 10:18:07 +08:00
  • d707678dfb Fix num_logprobs parameter description in sampler.py (#34451) Zhuohan Li 2026-02-12 18:18:03 -08:00
  • fc22cae4ac [CI/Build] Update video URLs for testing (#34446) Cyrus Leung 2026-02-13 10:15:36 +08:00
  • 96161fe978 [Kernel] [Helion] [4/N] Add silu_mul_fp8 Helion kernel (#33373) Yanan Cao 2026-02-12 18:13:12 -08:00
  • 4453ba8d9e [Core] Profiler improvements and lazy initialization (#33198) Jaewon 2026-02-12 16:16:38 -08:00
  • aa181c923b [Core] Add sleep level 0 mode with enqueue/wait pattern (#33195) Jaewon 2026-02-12 16:16:25 -08:00
  • be7370daf3 [Frontend] Enable generic structured_outputs for responses API (#33709) Alec S 2026-02-12 19:15:48 -05:00
  • 9ea1f598ce Use paged_attention_v1 for sliding window decode in rocm_aiter_fa (#34378) Mengtao (Martin) Yuan 2026-02-12 16:14:43 -08:00
  • f120bd42d3 [Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 (#33506) amitz-nv 2026-02-12 23:06:58 +02:00
  • fac4e96940 small adjustment to wvSplitKrc (#34410) Hashem Hashemi 2026-02-12 12:26:36 -08:00
  • 6d4e27ce29 [Bugfix] Enforce DeepGEMM when using sparse_attn_indexer on CUDA (#34374) Michael Goin 2026-02-12 15:08:06 -05:00
  • 4c078fa546 [ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility (#34447) Andreas Karatzas 2026-02-12 12:47:34 -06:00
  • 6c0baee610 [Voxtral Realtime] Refactor & Improve buffering logic (#34428) Patrick von Platen 2026-02-12 18:46:43 +01:00
  • 1100a97621 [Voxstral Realtime] Enable tests (#33803) Patrick von Platen 2026-02-12 18:43:24 +01:00
  • 766e167821 [ROCm][quantization] improve OCP weight quant parser robust (#34431) xuebwang-amd 2026-02-13 01:40:19 +08:00
  • becbe24808 [Bugfix] Remove broken raw url GGUF model loading support (#34433) Isotr0py 2026-02-13 01:40:01 +08:00
  • 679ca5d8d3 Fix MoE for the Transformers modelling backend (#34436) Harry Mellor 2026-02-12 18:29:42 +01:00
  • f2c47886fd [Attention] Add FlashInfer Sparse MLA backend (#33451) Matthew Bonanni 2026-02-12 12:21:54 -05:00
  • 334c715e0f [Docs] Spec decoding docs warning removal (#34439) Nicolò Lucchesi 2026-02-12 18:01:51 +01:00
  • 7b5a8b4a9d [BUG] Reset running requests when clearing cache for pause/resume (#34382) Aaron Hao 2026-02-12 08:19:13 -08:00
  • dea63512bb Add config file for fused MoE for Nemotron (TP4, B200) (#34411) danisereb 2026-02-12 16:09:55 +02:00
  • 8a798be929 [ROCm] Enable MXFP4 MoE weight pre-shuffling on gfx950 and update aiter (#34192) Douglas Lehr 2026-02-12 07:06:33 -06:00
  • fb455ed547 [V0 Deprecation] Remove code related to per-request logits processors (#34400) Cyrus Leung 2026-02-12 20:44:28 +08:00
  • 2d5be1dd5c release script khluu 2026-02-12 02:37:52 -08:00
  • f5897613fb Fix Mistral config remap to accept compressed-tensors quantization #34028 (#34104) baonudesifeizhai 2026-02-12 03:22:06 -05:00
  • 55a1a9563a Vllm CPU benchmark suite improvement (#34128) Louie Tsai 2026-02-12 00:04:44 -08:00
  • 386bfe5d08 [bugfix] refactor FunASR's _get_data_parser (#34397) AllenDou 2026-02-12 15:26:49 +08:00
  • e9cd691132 [Bugfix] Fix Sparse24 Compressed Tensors models (#33446) Kyle Sayers 2026-02-12 02:15:16 -05:00
  • 80f2ba6ea6 Fix DeepSeek-OCR tensor validation for all size variants (#34085) Yichuan Wang 2026-02-11 22:50:23 -08:00
  • 136b0bfa59 [BugFix] Fix DP chunking (#34379) Lucas Wilkinson 2026-02-11 23:44:03 -07:00
  • 7a06e5b05b [Bugfix] Fix MTP accuracy for GLM-5 (#34385) v0.16.0rc3 Michael Goin 2026-02-11 22:08:19 -05:00
  • 946b2f106c [Bugfix] send None sentinel on final commit so server properly sends transcription.done (#33963) Junseo Park 2026-02-12 06:01:53 +09:00
  • 5e8adb0c49 [Misc] Bump fastsafetensors version for latest fixes (#34273) Nick Hill 2026-02-11 00:30:09 -08:00
  • 9be1ff2d3a [Bugfix] fix default is_neox_style is True for deepseek (#34353) Xinyu Dong 2026-02-12 02:20:45 +08:00
  • b3ee90f961 [Model] GLM adaptation (#34124) Jee Jee Li 2026-02-09 17:32:52 +08:00
  • b96f7314b4 [Refactor] Pass Renderer to Input Processor (#34329) Cyrus Leung 2026-02-12 11:38:11 +08:00
  • ced2a92f40 [Refactor] Move validation to params definitions (#34362) Cyrus Leung 2026-02-12 11:33:15 +08:00
  • e1d97c38f8 [Bug Fix] Fix naive_block_assignment always defaulting to False due to arg misalignment (#33848) Runkai Tao 2026-02-11 22:30:57 -05:00
  • ec12d39d44 [Bugfix] Fix MTP accuracy for GLM-5 (#34385) Michael Goin 2026-02-11 22:08:19 -05:00
  • ff1f83b056 [Refactor] Replace activation: str with MoEActivation enum (#33843) Michael Goin 2026-02-11 20:29:32 -05:00
  • 83b47f67b1 [ci] Integrate AMD tests into CI (#33626) Kevin H. Luu 2026-02-11 16:54:17 -08:00
  • fb7b30c716 [ROCm][CI] Revert Test Groups From mi325_8 to mi325_1 Agent Pool In AMD CI (#34384) Micah Williamson 2026-02-11 17:52:34 -06:00
  • 31d992d215 [Bugfix] Fix some issues with MoERunner PR #32344 (#34371) bnellnm 2026-02-11 17:33:14 -05:00
  • 5aff2699bd Fix CI failure - Flashinfer Kernel tests (#34316) Wei Zhao 2026-02-11 17:17:16 -05:00
  • 527ca32197 [Bugfix] Fix more multimodal tests for transformers V5 (#34334) Raushan Turganbay 2026-02-11 22:02:05 +01:00
  • 5458eb835d [Bugfix] send None sentinel on final commit so server properly sends transcription.done (#33963) Junseo Park 2026-02-12 06:01:53 +09:00
  • 144d9b7cc8 [Benchmarks] Reduce ready checker log verbosity (#34349) Tomas Ruiz 2026-02-11 21:57:57 +01:00
  • 83e26c834e [GPT-OSS] Remove unnecessary contiguous (#34337) elvischenv 2026-02-12 04:29:29 +08:00
  • 5001211369 [ROCm] [CI] fix test_unrecognized_env (#34350) TJian 2026-02-12 02:50:44 +08:00
  • 11c7ace340 [Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) (#34243) Eldar Kurtić 2026-02-11 19:24:22 +01:00
  • be7f3d5d20 [Bugfix] fix default is_neox_style is True for deepseek (#34353) Xinyu Dong 2026-02-12 02:20:45 +08:00
  • 0ab06100f4 [Multimodal] Expose mm_processor_kwargs for DummyInputsBuilder (#34330) Isotr0py 2026-02-12 01:37:40 +08:00
  • ffb3d553cc [Model Runner V2] Init cuda graph pool when necessary (#33217) Xinyu Chen 2026-02-12 01:12:13 +08:00
  • fa7e0bfacf [CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… (#32458) junuxyz 2026-02-12 02:03:48 +09:00
  • 48134a2c22 [Docs] Fix typo ("defult") and double spacing (#34348) SorenDreano 2026-02-11 18:02:27 +01:00
  • 64f570ab56 [ROCm] [aiter] Split KV cache update for AiterFlashAttention (#33681) kliuae 2026-02-12 00:26:44 +08:00
  • fd618871b4 [Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache (#33948) Rohan Potdar 2026-02-11 10:12:05 -06:00
  • 67a42b5a44 Don't try and run GLM-ASR with remote code (#34352) Harry Mellor 2026-02-11 17:09:40 +01:00
  • c7914d30f9 Reapply [Attention][FA3] Update FA3 to include new swizzle optimization (#34043) Lucas Wilkinson 2026-02-11 08:07:56 -07:00
  • 1b8756562e Responses harmony system message structured (#34268) Adam Binford 2026-02-11 08:14:28 -05:00
  • 275e0d2a99 [NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE (#33715) Linda 2026-02-11 13:38:11 +01:00
  • 0f5e55e7a8 Make JAIS compatible with Transformers v5 (#34264) Harry Mellor 2026-02-11 13:30:37 +01:00
  • 1e9204bff3 Make Qwen3VL compatible with Transformers v5 (#34262) Harry Mellor 2026-02-11 13:13:23 +01:00
  • 05339a7b20 [Bugfix][CPU] Fix llama4 inference on CPU (#34321) Li, Jiang 2026-02-11 19:07:23 +08:00
  • 40b8f55358 [Docs] Reduce time spent generating API docs (#34255) Harry Mellor 2026-02-11 11:56:02 +01:00
  • c44d0c6d66 Patch protobuf for CVE-2026-0994 (#34253) v0.16.0rc2 Seiji Eicher 2026-02-11 02:25:04 -08:00
  • 83db96d8cd [XPU][9/N] clean up existing ipex code/doc (#34111) Kunshang Ji 2026-02-11 16:27:15 +08:00
  • dbfb79fe45 [XPU][7/N] enable xpu fp8 moe (#34202) zofia 2026-02-11 11:33:59 +08:00
  • b2e1fc3589 [Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching (#34183) Roger Wang 2026-02-09 21:03:32 -08:00
  • 55a1baebc5 [Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available (#34153) Gregory Shtrasberg 2026-02-09 17:38:54 -06:00
  • e1e9841631 [torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. (#33945) Charlie Fu 2026-02-09 15:15:43 -06:00
  • 5bd63387c3 [XPU][6/N] add xpu scaled_mm kernel (#34117) zofia 2026-02-09 20:17:35 +08:00
  • 5045d5c983 Patch protobuf for CVE-2026-0994 (#34253) Seiji Eicher 2026-02-11 02:25:04 -08:00