Commit Graph

  • a676e668ee [Bugfix] fix apply_temperature to avoid nan in probs (#24734) courage17340 2025-09-25 13:32:21 +08:00
  • c85be1f6dd optimize: eliminate duplicate split_enc_dec_inputs calls (#25573) Nicole LiHui 🥜 2025-09-25 13:03:25 +08:00
  • 845adb3ec6 [Model] Add LongCat-Flash (#23991) XuruiYang 2025-09-25 12:53:40 +08:00
  • 90b139cfff Enable Fbgemm NVFP4 on Dense models (#25609) Saman A. Pour 2025-09-24 21:12:53 -07:00
  • 4492e3a554 [Bug] Dynamo Unsupported due to BasevLLMParameter.torch_function calling disabled super() (#25613) Wentao Ye 2025-09-24 21:52:52 -04:00
  • 05c19485a5 [Kernel] Support DCP for Triton backend (#25132) Wei Wei 2025-09-24 18:09:34 -07:00
  • 52d0cb8458 [Model] Improve DotsOCRForCausalLM (#25466) Jee Jee Li 2025-09-25 07:58:08 +08:00
  • 5c1e496a75 [MISC] replace c10::optional with std::optional (#25602) Shiyan Deng 2025-09-24 16:56:21 -07:00
  • e7f27ea648 Improve --help for enhanced user experience (#24903) Harry Mellor 2025-09-25 00:08:18 +01:00
  • 1f29141258 [Refactor] Use DeepGEMM Col Major TMA Aligned Tensor (#25517) Wentao Ye 2025-09-24 18:52:36 -04:00
  • 6160ba4151 feat: BF16 FlashInfer Fused Cutlass MOE for Hopper and Blackwell Expert Parallel (#25503) Duncan Moss 2025-09-24 15:50:04 -07:00
  • fea8006062 [Logging] Improve log for when DeepEP HT disables CUDA Graphs (#25531) Tyler Michael Smith 2025-09-24 18:43:06 -04:00
  • e6750d0b18 [V0 Deprecation] Remove unused classes in attention (#25541) Woosuk Kwon 2025-09-24 13:24:40 -07:00
  • 8c853050e7 [Docs] Enable fail_on_warning for the docs build in CI (#25580) Harry Mellor 2025-09-24 20:30:33 +01:00
  • f84a472a03 Suppress benign cuBLAS warning when capturing cudagraphs with DBO (#25596) Sage Moore 2025-09-24 12:02:08 -07:00
  • 54e42b72db Support mnnvl all2allv from Flashinfer (#21003) Shu Wang 2025-09-24 13:38:16 -05:00
  • 2dda3e35d0 [Bugfix] add cache model when from object storage get model (#24764) rongfu.leng 2025-09-25 02:11:16 +08:00
  • d83f3f7cb3 Fixes and updates to bench_per_token_quant_fp8 (#25591) Michael Goin 2025-09-24 11:30:15 -04:00
  • 302eb941f3 [ROCm][Build][Bugfix] Fix ROCm base docker whls installation order (#25415) Gregory Shtrasberg 2025-09-24 11:25:10 -04:00
  • 487745ff49 [ROCm][Bugfix] Only enable +rms_norm based on aiter if not explicitly disabled (#25275) Gregory Shtrasberg 2025-09-24 11:24:39 -04:00
  • 9313be5017 [Misc] Improve type annotations for jsontree (#25577) Cyrus Leung 2025-09-24 22:49:58 +08:00
  • 8938774c79 Move DeviceConfig, ObservabilityConfig, SpeechToTextConfig to their own files (#25564) Harry Mellor 2025-09-24 14:59:05 +01:00
  • e18b714b2e [Bugfix] Fix DeepSeekV31ToolParser to correctly parse multiple tools in non-streaming output (#25405) Tao Hui 2025-09-24 20:58:00 +08:00
  • b1068903fd [docs] fix nixl kv_connector_extra_config.backends key (#25565) Peter Pan 2025-09-24 19:00:27 +08:00
  • 164299500b [Benchmark] Fix regression in structured output benchmark (#25500) Russell Bryant 2025-09-24 06:40:42 -04:00
  • 58c360d9be [Bug] fix import and unit test (#25558) Jonas M. Kübler 2025-09-24 12:17:59 +02:00
  • 42488dae69 [Bugfix] Fix dummy video number of frames calculation (#25553) Roger Wang 2025-09-24 02:47:30 -07:00
  • b67dece2d8 [misc] update the warning message (#25566) youkaichao 2025-09-24 17:24:35 +08:00
  • 2338daffd3 [BugFix] Potential Fix for FA3 full-cudagraph IMA (#25490) Lucas Wilkinson 2025-09-24 05:04:04 -04:00
  • 2e19a848d4 [V0 Deprecation] Remove max_seq_len_to_capture (#25543) Woosuk Kwon 2025-09-24 01:51:39 -07:00
  • 77a7fce1bb [CI/Build] add nightly prime-rl integration tests (#25207) Jackmin801 2025-09-24 01:44:22 -07:00
  • 6488f3481b [Misc]] Move processing context to multimodal directory (#25548) Cyrus Leung 2025-09-24 16:15:00 +08:00
  • 27ec3c78f3 [CI/Build] Fix v1 OOT registration test (#25547) Isotr0py 2025-09-24 16:03:13 +08:00
  • 1cbcfb94de [Bugfix][CPU] Skip unsupported custom op register on CPU (#25534) Li, Jiang 2025-09-24 14:21:51 +08:00
  • fed8a9b107 [Misc] Retry HF processing if "Already borrowed" error occurs (#25535) Cyrus Leung 2025-09-24 13:32:11 +08:00
  • 190c45a6af [TPU][Bugfix] fix the missing apply_model in tpu worker (#25526) Chengji Yao 2025-09-23 22:18:08 -07:00
  • 5caaeb714c [Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls (#25514) Ben Browning 2025-09-23 23:20:38 -04:00
  • d747c2ef18 [Perf] Fix jit compiles at runtime of fla gated delta rule (#25432) Corey Lowman 2025-09-23 23:16:13 -04:00
  • c30b405b8f [Spec Decode] Enable FlashInfer Spec Decoding (#25196) Benjamin Chislett 2025-09-23 22:29:58 -04:00
  • 77d906995c [KV sharing] Re-land Gemma3n model changes from #22628 (#24357) Yong Hoon Shin 2025-09-23 19:25:34 -07:00
  • 359d293006 [fix]: add Arm 4bit fused moe support (#23809) Nikhil Gupta 2025-09-24 02:32:22 +01:00
  • 9df8da548e [BugFix] Fix MLA assert with CUTLASS MLA (#25478) Lucas Wilkinson 2025-09-23 21:09:43 -04:00
  • bf68fd76a9 [Compile] Fix AMD Compile Error (#25518) Wentao Ye 2025-09-23 20:42:48 -04:00
  • de94289a98 [Core] Support weight_loader_v2 for UnquantizedLinearMethod (#23036) Kyle Sayers 2025-09-24 01:30:26 +01:00
  • 1983609239 [Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (#25520) Benjamin Chislett 2025-09-23 20:19:56 -04:00
  • d06b5a95cb [V1][Metrics] Add per-request TPOT histogram (#24015) baxingpiaochong 2025-09-24 08:19:04 +08:00
  • be0bb568c9 [Model] Support SeedOss Reason Parser (#24263) 0xNullPath 2025-09-24 08:15:51 +08:00
  • c8bde93367 [BUG] Allows for RunAI Streamer and Torch.compile cache to be used together (#24922) ahao-anyscale 2025-09-23 17:13:32 -07:00
  • 88d7bdbd23 [Bug] Fix AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv' (#25519) Wentao Ye 2025-09-23 20:07:51 -04:00
  • 0d235b874a Add CUTLASS FP8 MOE benchmark scripts and kernel config (#25302) Chenxi Yang 2025-09-23 17:07:42 -07:00
  • 7ad5e50adf Improve output when failing json.loads() on structured output test (#25483) Doug Smith 2025-09-23 20:03:31 -04:00
  • dc464a3d39 [BugFix] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch (#25505) Lucas Wilkinson 2025-09-23 20:00:29 -04:00
  • 1210e4d95b [Bugfix] [B200] cutlass_mla - ensure kv_split == 1 for batch size > 1 (#25509) Alexander Matveev 2025-09-23 19:57:55 -04:00
  • e0b24ea030 [Perf] Increase default max splits for FA3 full cudagraphs (#25495) Lucas Wilkinson 2025-09-23 19:53:34 -04:00
  • bde2a1a8a4 [ROCm] Small functional changes for gptoss (#25201) Juan Villamizar 2025-09-23 18:39:50 -05:00
  • 5e25b12236 [Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for _chunk_cumsum_fwd_kernel (#25197) Thomas Parnell 2025-09-24 01:23:30 +02:00
  • c85d75cf08 Add VLLM_NVTX_SCOPES_FOR_PROFILING=1 to enable nvtx.annotate scopes (#25501) Corey Lowman 2025-09-23 18:50:09 -04:00
  • abad204be6 [BugFix] Fix OOM in vLLM replicas by ensuring consistent NCCL memory accounting (#25359) kourosh hakhamaneshi 2025-09-23 15:49:09 -07:00
  • 7361ab379f Remove redundant mutates_args and dispatch_key for direct_register_custom_op (#25512) Michael Goin 2025-09-23 18:48:40 -04:00
  • 95bc60e4cb [gpt-oss][bugfix] remove logic to require resp_ in ResponseAPI (#25428) Andrew Xia 2025-09-23 15:46:46 -07:00
  • 4f2954f724 Fix triton_reshape_and_cache_flash.py triton import (#25522) Michael Goin 2025-09-23 18:26:10 -04:00
  • eca7be9077 Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… (#25493) rouchenzi 2025-09-23 15:17:49 -07:00
  • 969b4da3a6 [V0 Deprecation] Remove placeholder attn (#25510) Thomas Parnell 2025-09-24 00:12:14 +02:00
  • 4f8c4b890a [Core] Use KVCacheBlock as much as possible instead of dict[block_id, KVCacheBlock] (#24830) Jialin Ouyang 2025-09-23 15:11:14 -07:00
  • ae002924e9 [CI/Build] Fix and re-enable v1 PP test on CI (#25496) Isotr0py 2025-09-24 05:58:25 +08:00
  • 690f948e4a [Bugfix] Fix for the import error from #24588 (#25481) Gregory Shtrasberg 2025-09-23 17:31:08 -04:00
  • 08275ec0a2 [Build] Update Xgrammar to 0.1.25 (#25467) Chauncey 2025-09-24 05:25:46 +08:00
  • c828d1bf98 [Bugfix] gpt-oss container tool output bug (#25485) Alec S 2025-09-23 16:43:45 -04:00
  • 8b8a8afc89 [CI] Fix Pre-commit Issue (#25497) Wentao Ye 2025-09-23 16:09:37 -04:00
  • 8bdd8b5c51 Enable symmetric memory all reduce by default only enabling for TP (#25070) Ilya Markov 2025-09-23 21:53:00 +02:00
  • a8ffc4f0f2 [Bugfix] Lower gpt-oss max cudagraph size to 992 to be compatible with FA3 (#25508) Michael Goin 2025-09-23 15:49:55 -04:00
  • d5944d5146 [Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue (#25406) jiahanc 2025-09-23 12:44:35 -07:00
  • 24fab45d96 [Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE (#25444) Michael Goin 2025-09-23 15:29:26 -04:00
  • 63400259d0 [Performance] Move apply_w8a8_block_fp8_linear to an op class (#24666) ElizaWszola 2025-09-23 21:03:10 +02:00
  • 8c1c81a3de [core] add nccl symmetric memory for all reduce (#24532) Amir Samani 2025-09-23 11:33:06 -07:00
  • a3a7828010 [ROCm] Add skinny gemm bias support for dtypes fp16,bf16,fp8 (#24988) Hashem Hashemi 2025-09-23 11:31:45 -07:00
  • 5abb117901 [Core] Ensure LoRA linear respect the base_layer's tp_size and tp_rank (#25487) Jee Jee Li 2025-09-24 02:19:25 +08:00
  • 867ecdd1c8 [Spec Decode][CI] Add e2e test for examples/spec_decode.py and prevent breaking Acceptance Length (#24531) Ekagra Ranjan 2025-09-23 13:46:40 -04:00
  • 24e8222745 [Misc] Reduce initialization time of auto_tune (#23682) Weida Hong 2025-09-24 01:34:58 +08:00
  • 100b630a60 [V1][Kernel] Add triton implementation for reshape_and_cache_flash (#24503) Burkhard Ringlein 2025-09-23 18:52:40 +02:00
  • 527821d191 Use macro guard CUDA functions for back compatibility in grouped_topk_kernel.cu (#25346) Ming Yang 2025-09-23 09:45:39 -07:00
  • 846197f505 [Log] Optimize kv cache memory log from Bytes to GiB (#25204) Wentao Ye 2025-09-23 12:44:37 -04:00
  • 2357480b1a [BugFix] Fix UB in per_token_group_quant.cu (#24913) rivos-shreeasish 2025-09-23 09:14:22 -07:00
  • f11e3c516b [Kernels] Support blocked fp8 quantization for compressed tensors MoE (#25219) bnellnm 2025-09-23 12:11:34 -04:00
  • 875d6def90 Add backward compatibility for GuidedDecodingParams (#25422) Harry Mellor 2025-09-23 17:07:30 +01:00
  • cc1dc7ed6d [Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support (#24845) Lucas Wilkinson 2025-09-23 12:02:10 -04:00
  • a903669e10 [V1] Remove V0 code paths for Hybrid models (#25400) Thomas Parnell 2025-09-23 17:26:13 +02:00
  • 2c58742dff [UX] Change kv-cache-memory log level to debug (#25479) Michael Goin 2025-09-23 11:01:24 -04:00
  • 4c966e440e [XPU] Fix MOE DP accuracy issue on XPU (#25465) Fanli Lin 2025-09-23 22:32:57 +08:00
  • da5e7e4329 [Docs] NixlConnector quickstart guide (#24249) Peter Pan 2025-09-23 22:23:22 +08:00
  • f05a4f0e34 [P/D] Support NIXL connector to disconnect during a clean shutdown (#24423) Chauncey 2025-09-23 22:08:02 +08:00
  • 61d1b35561 [BugFix] Register expert_map as named buffer for wake_up and sleep (#25458) Joel 2025-09-23 21:49:13 +08:00
  • b6a136b58c [CI/Build] Fix disabled v1 attention backend selection test (#25471) Isotr0py 2025-09-23 21:05:46 +08:00
  • 0d9fe260dd [docs] Benchmark Serving Incorrect Arg (#25474) vllmellm 2025-09-23 21:05:11 +08:00
  • 273690a50a [Core] Optimize LoRA weight loading (#25403) Jee Jee Li 2025-09-23 18:19:45 +08:00
  • 231c2c63e4 [Bugfix] Fix idefics3 tie_word_embeddings (#25454) Isotr0py 2025-09-23 18:06:48 +08:00
  • 4322c553a6 [Test]: Hermes tool parser stream output error in Qwen3 case (#25203) Andreas Hartel 2025-09-23 11:56:31 +02:00
  • babad6e5dd [Misc] Move DP for ViT code inside model executor dir (#25459) Cyrus Leung 2025-09-23 17:20:52 +08:00
  • 9383cd6f10 [Frontend] Add a new xml-based tool parser for qwen3-coder (#25028) Zhikaiiii 2025-09-23 16:07:27 +08:00
  • ba8d2165b6 Handle triton kernel import exception (#25319) Ming Yang 2025-09-23 00:56:00 -07:00