Commit Graph

  • e568cf88bc [UX] Infer dtype for local checkpoint (#36218) Isotr0py 2026-03-11 16:50:04 +08:00
  • 098d844731 [NIXL][1/N] Refactor kernel_block_size detection (#35752) Nicolò Lucchesi 2026-03-11 09:11:23 +01:00
  • a40ee486f2 [Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 (#35923) JartX 2026-03-11 08:45:57 +01:00
  • eac2dc2b41 AITER MLA backend: Avoid CPU sync in _build_decode (#35765) pschlan-amd 2026-03-11 08:25:00 +01:00
  • d5080aeaa4 [Refactor] Remove deadcode in Responses API serving (#36726) Flora Feng 2026-03-11 03:11:41 -04:00
  • f22d6e0267 [Hardware][NIXL] set default kv buffer type for different platform (#36438) liuzhenwei 2026-03-11 13:19:28 +08:00
  • 76c6e6da08 [XPU] Support block fp8 moe by fallback to TritonExpert on XPU (#36458) Kunshang Ji 2026-03-11 12:54:09 +08:00
  • 4184653775 feat: add RISC-V support for CPU backend (v2) (#36578) typer-J 2026-03-11 12:51:39 +08:00
  • 4aaaf8c8ce feat(spec_decode): fuse EAGLE step slot mapping and metadata updates (#33503) Sladyn 2026-03-10 21:35:33 -07:00
  • 4bf533623b [Doc] Fix duplicate words in comments (#36713) Hongbin Guo 2026-03-11 12:28:31 +08:00
  • 5f77ef15ae [Misc][Attention] Clean up unused method in CPU_ATTN (#36673) Matthew Bonanni 2026-03-11 00:27:22 -04:00
  • 7d6abdd022 [Fix] Use torch.empty for output in attention+quant fusion (#31785) elvischenv 2026-03-11 12:26:14 +08:00
  • a8ff2cca92 [Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement (#35781) Wentao Ye 2026-03-11 00:25:30 -04:00
  • 42fadebecb [Model] Add support for moonshotai/Kimi-Audio-7B-Instruct (#36127) tunglinwood 2026-03-11 12:24:48 +08:00
  • a197eda9c3 Add tuned H100 MoE configs for LFM2 8B and 24B (#36699) tianshu-Michael-yu 2026-03-10 21:22:02 -07:00
  • 1ff2393897 [ci] Bound nvidia-cudnn-frontend version (#36719) Kevin H. Luu 2026-03-10 21:17:35 -07:00
  • 5bec0b0ba3 [DSV3.2][MTP] Optimize Indexer MTP handling (#36723) Benjamin Chislett 2026-03-11 00:16:56 -04:00
  • 82b110d50e [ci] Bound nvidia-cudnn-frontend version (#36719) Kevin H. Luu 2026-03-10 21:17:35 -07:00
  • 9040cd40af [DSV3.2][MTP] Optimize Indexer MTP handling (#36723) Benjamin Chislett 2026-03-11 00:16:56 -04:00
  • fa0d353acf [Bugfix] Surface exceptions from non-blocking execute_model in UniProcExecutor to avoid DP deadlocks (#35194) fangyuchu 2026-03-11 11:22:21 +08:00
  • b386bb3d7c fix bugs when token_classify & classify run concurrently (#36614) Augusto Yao 2026-03-11 11:16:34 +08:00
  • fe714dd507 [openapi server] log exception in exception handler(2/N) (#36201) Ning Xie 2026-03-11 11:16:30 +08:00
  • 8ab3d7427c [Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling (#36691) Matthew Bonanni 2026-03-10 23:01:07 -04:00
  • 6da1310f91 [Bug] Fix TRTLLM Block FP8 MoE Monolithic (#36296) Wei Zhao 2026-03-10 22:04:47 -04:00
  • 84e436ed1c [Bug] Fix TRTLLM Block FP8 MoE Monolithic (#36296) Wei Zhao 2026-03-10 22:04:47 -04:00
  • 81939e7733 [ROCm][CI] Making some tests optional to reduce workload (#36090) Andreas Karatzas 2026-03-10 18:45:27 -05:00
  • 195d1ca3e8 [Minor] Enhance error message for TRTLLM decode uniformity check (#36609) Woosuk Kwon 2026-03-10 15:38:45 -07:00
  • 8d983d7cd6 [Model Runner V2] Add initial CI tests (#36041) Nick Hill 2026-03-10 14:55:21 -07:00
  • 65b2f405dc [Core] Simplify core kv-cache blocks initialization logic (#36521) Nick Hill 2026-03-10 13:20:02 -07:00
  • bc46be5daf Revert "add nemotron v3 reasoning parser (#36393)" khluu 2026-03-10 11:47:09 -07:00
  • 2a68464c5b [Test] test_async_scheduling.py improvements (#36340) Nick Hill 2026-03-10 11:17:26 -07:00
  • 8e39d39fd4 add nemotron v3 reasoning parser (#36393) Shaun Kotek 2026-03-10 00:11:41 +02:00
  • bdd8981dab [compile] Apply stored functorch config while finalizing loaded artifacts. (#36582) Zhengxu Chen 2026-03-10 12:34:35 -04:00
  • f088a831dd [Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata (#36626) Woosuk Kwon 2026-03-10 09:30:56 -07:00
  • 46fa044cc1 [BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219) Vadim Gimpelson 2026-03-10 14:32:20 +04:00
  • ab43e37158 Fix: Re-Enable EP for trtllm MoE FP8 backend (#36494) amirkl94 2026-03-10 08:11:27 +02:00
  • f45d010120 Fix/resupport nongated fused moe triton (#36412) Shaun Kotek 2026-03-09 20:01:18 +02:00
  • 244b922088 [Bugfix] Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 (#36017) amitz-nv 2026-03-05 00:23:51 +02:00
  • f83b933b84 [CI] Bump mypy version to 1.19.1 (#36104) v0.17.1rc0 Harry Mellor 2026-03-10 16:18:28 +00:00
  • 82f3f30e26 [ROCm][Perf] Enable sparse_mla's cudagraph on ROCm platform (#35719) Pleaplusone 2026-03-11 00:14:35 +08:00
  • 9095cbbfb6 [Bugfix][Sparse MLA] report indexer CG support properly (#36519) Matthew Bonanni 2026-03-10 12:14:31 -04:00
  • 721ae79f50 Improvements to wvSplitKrc skinny GEMM solution (#34304) Hashem Hashemi 2026-03-10 09:14:27 -07:00
  • aefc59f088 FunASR model bugfix (#36633) AllenDou 2026-03-10 23:14:21 +08:00
  • d88f28da05 Fix hf_override_fn when it modifies model_type (#35200) Harry Mellor 2026-03-10 15:03:18 +00:00
  • 106ff69c4e feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency (#35342) Srinivasoo7 2026-03-10 09:43:40 -05:00
  • ca5fb4bbd8 [Bugfix] Avoid merging empty-only partitions into splitting-op subgraphs (#36595) Jiangyun Zhu 2026-03-10 22:39:01 +08:00
  • cf88b23749 fix: check HTTP status in batch read_file to prevent silent failures (#36397) Alvin Tang 2026-03-10 22:22:40 +08:00
  • a3189a08b0 [Model] Consolidate score logic by introduce score_type (#36479) wang.yuqi 2026-03-10 21:32:25 +08:00
  • 409c4e632d [Misc] fix typo: homogenous-> homogeneous (2 lines change) (#36508) SoluMilken 2026-03-10 21:25:37 +08:00
  • 8850738b70 [Bugfix] Fix processor signature (#36630) Raushan Turganbay 2026-03-10 14:20:47 +01:00
  • 234860399b [Frontend][Core] Revert "Add shutdown timeout" (#34730 and #36270) (#36628) Mark McLoughlin 2026-03-10 13:20:41 +00:00
  • c88510083b Fix Qwen2.5-VL test for Transformers v5 (#36532) Harry Mellor 2026-03-10 12:05:34 +00:00
  • 4ff8c3c8f9 [BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219) Vadim Gimpelson 2026-03-10 14:32:20 +04:00
  • 507ddbe992 feat(grpc): extract gRPC servicer into smg-grpc-servicer package, add --grpc flag to vllm serve (#36169) Chang Su 2026-03-10 03:29:59 -07:00
  • ddbb0d230a [Model Runner V2] Fix mm input embeddings lookup (#36588) Nick Hill 2026-03-10 00:24:58 -07:00
  • 9efc3bdcd6 [Model Runner V2] Fix _compute_slot_mappings_kernel for chunked prefill (#36580) Nick Hill 2026-03-10 00:23:42 -07:00
  • 156e33553c Fix: Re-Enable EP for trtllm MoE FP8 backend (#36494) amirkl94 2026-03-10 08:11:27 +02:00
  • d0cd736caa [Bugfix] Fix RuntimeError: Already borrowed that degrades VLM serving throughput under concurrent load. (#36557) hallerite 2026-03-09 22:30:51 -07:00
  • 195c997203 Fix LFM2 MoE test for Transformers v5 (#36534) Harry Mellor 2026-03-10 05:29:17 +00:00
  • 04b67d8f62 Remove unused disable_fallback field (#36546) Zhuohan Li 2026-03-09 20:56:54 -07:00
  • 7279374f91 [Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement (#36159) Wentao Ye 2026-03-09 23:55:58 -04:00
  • 006aea17d7 [BugFix] Remove incorrect assert in split_decodes_and_prefills (#36553) Woosuk Kwon 2026-03-09 20:02:02 -07:00
  • 0836be3b03 [Model] Add HyperCLOVAX-SEED-Think-32B vision-language model support (#31471) Hojin Yang 2026-03-10 11:59:19 +09:00
  • 4e95ec111c [Bugfix] Fix Qwen3-Next in_proj_ba weight sharding with TP > 1 (#36242) Ajay Anubolu 2026-03-09 19:16:26 -07:00
  • 179547d62c [ROCm][CI] Fix ROCm GPT-OSS Eval test group (#36179) Andreas Karatzas 2026-03-09 19:55:20 -05:00
  • f85b4eda3a [bugfix] fix nvlink for nixl/ucx (#36475) youkaichao 2026-03-10 07:49:47 +08:00
  • 2a194ddd72 [Model Runner V2] Add model_state inputs to CUDA graph capture (#36544) Woosuk Kwon 2026-03-09 15:14:51 -07:00
  • 203a7f27da add nemotron v3 reasoning parser (#36393) Shaun Kotek 2026-03-10 00:11:41 +02:00
  • 483463f735 [MRV2] Extensible CG dispatch rework (#35959) Lucas Wilkinson 2026-03-09 16:58:45 -04:00
  • 4e571ce643 [MTP][Misc] Clean up dead code (#36507) Matthew Bonanni 2026-03-09 14:43:06 -04:00
  • 4ff9b045fe [ROCm][CI] Prep Tests For Change To ROCM_ATTN As New Default Backend On ROCm (#36025) Micah Williamson 2026-03-09 13:27:55 -05:00
  • 3fd03f1ec2 [BE] Rename should_torch_compile_mm_vit to should_torch_compile_mm_encoder (#36281) Lucas Kabela 2026-03-09 11:22:05 -07:00
  • 10a5f4d53d [Model Runner V2] Use NamedTuple for execute_model_state (#35930) Woosuk Kwon 2026-03-09 11:17:34 -07:00
  • fe0c085c28 [Docs] Remove the reo beacon (#36528) Simon Mo 2026-03-09 11:16:50 -07:00
  • 8d6b3d5dda [Misc] Refactored 5 duplicate helper functions that were copied-pasted across multiple parsers (#36436) Taneem Ibrahim 2026-03-09 14:14:11 -04:00
  • 4b87ffbefb [torch.compile] Rename compile_ranges_split_points to compile_ranges_endpoints (#36027) Copilot 2026-03-09 18:04:40 +00:00
  • fa028207aa Fix/resupport nongated fused moe triton (#36412) Shaun Kotek 2026-03-09 20:01:18 +02:00
  • d460a18fc6 [Docs] Expand --allowed-media-domains security guidance with threat details (#36506) Russell Bryant 2026-03-09 13:43:42 -04:00
  • 6e956d9eca [Model Runner V2] Add dummy profile_cudagraph_memory API (#36520) Woosuk Kwon 2026-03-09 10:20:13 -07:00
  • 1e0f917b34 [ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm (#36101) Andreas Karatzas 2026-03-09 12:07:44 -05:00
  • c174d54f86 [ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks (#36292) Andreas Karatzas 2026-03-09 12:02:41 -05:00
  • 55d27cca55 [Misc] fix typo: dependant -> dependent (2 lines change) (#36511) SoluMilken 2026-03-10 01:00:12 +08:00
  • 580864d81e [Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 (#34917) Roberto L. Castro 2026-03-09 17:50:36 +01:00
  • 2b28b9b269 [Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 (#35290) Roberto L. Castro 2026-03-09 17:46:57 +01:00
  • 70485a11bd [ROCM] Optimize the fused_topk_bias to use aiter instead of fallback torch ops. (#36253) Taoyu Zhu 2026-03-10 00:30:35 +08:00
  • 74a9f54cdb [CI] Fix edge case that could lead to broken docs builds on main (#36515) Harry Mellor 2026-03-09 16:06:19 +00:00
  • 00c4cb5606 [Bugfix] Clear stale CG keys after memory profiling (#36416) Matthew Bonanni 2026-03-09 11:56:00 -04:00
  • 941e52c298 [Refactor] Simplify chat_completion_full_generator for tool parsers (#35634) Wentao Ye 2026-03-09 11:33:46 -04:00
  • be292b7c14 [Bug] Fix pooling model benchmark script (#36300) Wentao Ye 2026-03-09 11:17:45 -04:00
  • 77a73458e3 Reapply [Attention] Refactor check_and_update_config (#35122) Matthew Bonanni 2026-03-09 10:17:14 -04:00
  • 5578f2a4d3 Support online use_audio_in_video (#36319) Tianyu Guo 2026-03-09 22:16:44 +08:00
  • 3ec2115015 [Frontend] Move warmup into Renderer (#36482) Cyrus Leung 2026-03-09 21:03:21 +08:00
  • b0906d8b02 [MM Encoder] Default to use TORCH_SDPA backend for ViT on Volta/Turing GPU (#36472) Isotr0py 2026-03-09 18:43:44 +08:00
  • aaf5fa9abf [ci] Bound openai dependency to 2.24.0 (#36471) Kevin H. Luu 2026-03-09 03:43:26 -07:00
  • f96c3ab08c [Deprecation][1/2] Remove items deprecated in v0.18 (#36470) Cyrus Leung 2026-03-09 18:43:23 +08:00
  • dc6b578466 [Kernel] Add fused_sigmoid_gating_delta_rule_update kernel for Qwen3 Next (#35777) Xin Yang 2026-03-08 23:41:01 -07:00
  • 1bc9c77f6d [XPU] Add test script of PD disaggregation (#36434) liuzhenwei 2026-03-09 13:50:27 +08:00
  • 65a4da1504 [Frontend] Add Support for MM Encoder/Decoder Beam Search (Online Transcriptions) (#36160) Alex Brooks 2026-03-08 23:46:23 -06:00
  • 217f27598d [Bugfix] Avoid to replace non-tensor members in cpu model runner (#36430) Li, Jiang 2026-03-09 13:06:28 +08:00
  • fff3711a24 [Frontend][2/n] Improve pooling entrypoints | embed. (#36110) wang.yuqi 2026-03-09 11:42:19 +08:00