Commit Graph

  • 811ac13d03 [Core] Factor out common logic for MM budget calculation (#22228) Cyrus Leung 2025-08-05 14:54:55 +08:00
  • e79a12fc3a [UX] Fail if an invalid attention backend is specified (#22217) Michael Goin 2025-08-05 02:54:52 -04:00
  • cdfd6871a5 [Bugfix] Misaligned params in TreeAttentionImpl (#22226) Cyrus Leung 2025-08-05 13:40:09 +08:00
  • 4b3e4474d7 Optimize configuration access with LRU cache in custom ops (#22204) ZiTian.Zhao 2025-08-05 12:43:24 +08:00
  • bd3db7f469 [Misc] log more detailed message for ensure_model_parallel_initialized (#22144) Ning Xie 2025-08-05 10:36:55 +08:00
  • 29b97c0995 [Doc] add backend to doc string of initialize_model_parallel (#22142) Ning Xie 2025-08-05 10:36:20 +08:00
  • 7b455cf1c0 [Misc] Remove pass_config from CompilationConfig dump_json excluded (#21911) elvischenv 2025-08-05 10:17:18 +08:00
  • 8a6e108e76 fix: kimi_k2 return empty tool call list (#22149) tlipoca9 2025-08-05 10:15:31 +08:00
  • d7b28f3415 [Log] DeepGEMM Update Log for Unaligned Problem Size (#22208) Wentao Ye 2025-08-04 22:13:19 -04:00
  • 6fa41e0c32 self.gate dtype update for GLM-4.5 (#22203) Yuxuan Zhang 2025-08-05 10:12:38 +08:00
  • 031ca762d7 [ROCm][Bugfix] Compilation passes fix (#22202) Gregory Shtrasberg 2025-08-04 22:12:28 -04:00
  • 6ad6b8e115 [FEAT] Refactor ROPE into module (#22192) TJian 2025-08-04 19:12:16 -07:00
  • f4f4e7ef27 [V0 deprecation][P/D] Deprecate v0 KVConnectorBase code (1/2) (#21785) lkchen 2025-08-04 19:11:33 -07:00
  • 5ea71ff46f [V1] reduce block size for tree attention correctness test to fix 'ou… (#22207) Giancarlo Delfin 2025-08-04 19:11:06 -07:00
  • 7175817637 Revert "[Bugfix] V1 Fix the cursor leakage issue during request scheduling." (#22223) Woosuk Kwon 2025-08-04 18:37:06 -07:00
  • 2dffac464c [Bugfix] V1 Fix the cursor leakage issue during request scheduling. (#21173) PiteXChen 2025-08-05 09:34:10 +08:00
  • bdcb42e45d [NVIDIA] Auto detect modelopt quant and fix DSR1-FP4 weight loading (#22073) Po-Han Huang (NVIDIA) 2025-08-05 09:02:55 +08:00
  • c09efff976 [Bugfix][V1][P/D]Fix the uneven polling issue in the toy proxy for P2pNcclConnector (#21819) Zhonghua Deng 2025-08-05 04:17:05 +08:00
  • 309c1bb822 [Bug] Update auto_tune.sh to separate benchmarking and profiling. (#21629) ericehanley 2025-08-04 10:12:06 -05:00
  • 9af654cc38 [Responses API] Ignore store=True and process the request by default (#22185) Woosuk Kwon 2025-08-04 05:12:48 -07:00
  • a5fff3bd49 Fix Arcee model weight loading: Add custom load_weights (#21725) Raghav Ravishankar 2025-08-04 16:39:56 +05:30
  • 1539ced93a [Doc] Update pooling model docs (#22186) Cyrus Leung 2025-08-04 18:37:06 +08:00
  • 54de71d0df [Sampler] Support returning all logprobs or logits (#21792) 22quinn 2025-08-04 03:04:12 -07:00
  • fed5849d3f [Bugfix] Fix failing GGUF models test (#22174) Isotr0py 2025-08-04 16:27:02 +08:00
  • c1b4eb048a [feat] move WEIGHT_SCALE_SUPPORTED into raise block to accelerate RLHF weight loading (#21164) Weixiao Huang 2025-08-04 15:43:06 +08:00
  • a7b8788d2c [Misc] Modify the organization of GLM series (#22171) Jee Jee Li 2025-08-04 14:51:20 +08:00
  • 8ecb3e9e93 [CI Bugfix] Fix wNa16 kernel not found for test_shared_storage_connector_hashes (#22163) Tyler Michael Smith 2025-08-04 01:19:04 -04:00
  • e5949e5ae0 Remove index_put from MM embeddings merging (#22105) Chenxi Yang 2025-08-03 22:15:14 -07:00
  • 49bcd893e7 [refactor] improve ConstantList exception specificity (#22156) ZiTian.Zhao 2025-08-04 13:14:49 +08:00
  • aa7012eb6d Add tree attention backend for v1 (part 1) (#20401) Giancarlo Delfin 2025-08-03 22:13:26 -07:00
  • c2e75b3c11 remove duplicate code within cleanup_dist_env_and_memory (#22147) Ning Xie 2025-08-04 11:03:58 +08:00
  • 0d7db16a92 [PD] add test for chat completions endpoint (#21925) Abirdcfly 2025-08-04 10:57:03 +08:00
  • 845420ac2c [RLHF] Fix torch.dtype not serializable in example (#22158) 22quinn 2025-08-03 19:43:33 -07:00
  • e27d25a0dc [fix] fix correct assertion syntax error in attention utils. (#22154) ZiTian.Zhao 2025-08-04 10:24:02 +08:00
  • 6f5478298d Use aiohttp connection pool for benchmarking (#21981) Seiji Eicher 2025-08-03 19:23:32 -07:00
  • 6a39ba85fe [Bugfix] Fix failing multimodal standard test (#22153) Isotr0py 2025-08-04 03:04:38 +08:00
  • d3c18c9cb0 fuse fp32 for GLM-4.5 e_score_correction_bias (#22143) Yuxuan Zhang 2025-08-04 00:04:54 +08:00
  • 83f7bbb318 Add chat doc in quick start (#21213) TankNee 2025-08-03 22:47:55 +08:00
  • b5dfb94fa0 [CI/Build][Bugfix] Fix Qwen2.5 tests in CPU CI via fallback silu_and_mul to torch native implementation (#22145) Li, Jiang 2025-08-03 20:34:04 +08:00
  • 6d98843b31 [Responses API] Disable response store by default (#22137) Woosuk Kwon 2025-08-03 04:04:21 -07:00
  • aefeea0fde [V1] [P/D] Refactor KV Connector Path (#21980) David Ben-David 2025-08-03 14:03:40 +03:00
  • 24d1dffbeb [executor] feat: add supports_pp attr to executors (#21786) H 2025-08-03 03:04:45 -07:00
  • 7de45db9a5 [Misc] update doc comment for send (#22026) Ning Xie 2025-08-03 15:55:20 +08:00
  • 789562c28c Support CUTLASS NVFP4 (w4a4) for Blackwell Geforce GPUs (SM120) (#21309) Roberto L. Castro 2025-08-03 09:54:22 +02:00
  • 3f36c325fa [Benchmark] Support ready check timeout in vllm bench serve (#21696) Ye (Charlotte) Qi 2025-08-03 00:52:38 -07:00
  • 3dddbf1f25 [Misc] Add tensor schema test coverage for multimodal models (#21754) Isotr0py 2025-08-03 15:52:14 +08:00
  • 337eb23bcc [Fix] Fix llama4 modelopt weight loading error (#22107) jiahanc 2025-08-03 00:50:34 -07:00
  • 2ff46b8826 [Misc] Bump ray to 2.48.0 (#22123) Rui Qiao 2025-08-02 19:42:00 -07:00
  • 554df8a6a2 Revert "[compile][startup] Disable C++ compilation of symbolic shapes" (#22122) Xiao 2025-08-02 09:03:30 -07:00
  • 73e1b9b1d4 [xpu]support moe models on XPU platform (#21643) Yan Ma 2025-08-02 22:49:08 +08:00
  • 4abfd8796f [V1] [Hybrid] Validate compatibility of attention backend batch reordering at init time (#21557) Thomas Parnell 2025-08-02 14:29:40 +02:00
  • f5d0f4784f [Frontend] Improve error message for too many mm items (#22114) Cyrus Leung 2025-08-02 17:20:38 +08:00
  • b690e34824 [Model] Mamba2 preallocate SSM output tensor to avoid d2d copy overhead (#21075) Chih-Chieh Yang 2025-08-02 04:59:34 -04:00
  • 25373b6c6c for glm-4.1V update (#22000) Yuxuan Zhang 2025-08-02 16:46:57 +08:00
  • 58eee5f2e0 [PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion (#20000) Vadim Gimpelson 2025-08-02 12:43:52 +04:00
  • 067c34a155 docs: remove deprecated disable-log-requests flag (#22113) Roger Wang 2025-08-02 00:19:48 -07:00
  • c64861d63c [Bugfix] Mamba2 remove bugged initial state condition in chunk scan (#22034) Chih-Chieh Yang 2025-08-02 02:55:57 -04:00
  • 8564dc9448 Fix test_kv_sharing_fast_prefill flakiness (#22038) Yong Hoon Shin 2025-08-01 23:55:34 -07:00
  • 4ac8437352 [Misc] Getting and passing ray runtime_env to workers (#22040) Rui Qiao 2025-08-01 23:54:40 -07:00
  • d3a6f2120b [FEAT][ROCm] Enable running Flash Attention as ViT attn backend for Qwen-VL models on ROCm platform. (#22069) vllmellm 2025-08-02 14:53:18 +08:00
  • 0edaf752d7 [Attention][DBO] Add support for "splitting" the CommonAttentionMetadata (#21153) Sage Moore 2025-08-01 19:47:53 -07:00
  • 6e8d8c4afb [Test] Add Unit Test for Batched DeepGEMM (#21559) Wentao Ye 2025-08-01 22:45:46 -04:00
  • 8d524ce79f [BugFix] Improve internal DP load balancing (#21617) Nick Hill 2025-08-02 03:45:27 +01:00
  • 9f9c38c392 [Speculators][Speculative Decoding] Add Qwen Eagle3 Support (#21835) Dipika Sikka 2025-08-01 22:43:37 -04:00
  • a65f46be5e [Misc] DeepGemmExperts : Avoid JIT generation in the hot-path (#21955) Varun Sundar Rabindranath 2025-08-02 08:12:03 +05:30
  • 57393715e8 [Misc] VLLM_TARGET_DEVICE.lower() (#22101) Nicolò Lucchesi 2025-08-02 04:41:40 +02:00
  • ee2eb6ecd8 [Model] Qwen2.5 VL SiLU-and-Mul (#22066) vllmellm 2025-08-02 10:34:37 +08:00
  • 23322431c8 [V1][CUDA] Full cudagraph support for FlashInfer (#21367) fhl2000 2025-08-02 09:49:34 +08:00
  • 3654847db5 feat: Add Support GPTQ Quantization MOE on ROCM vllm serve (#21733) JartX 2025-08-02 03:12:19 +02:00
  • eefbf4a68b [Perf] Optimize reshape_and_cache_flash CUDA Kernel (#22036) Wentao Ye 2025-08-01 19:18:51 -04:00
  • 88faa466d7 [CI] Initial tests for SM100 Blackwell runner (#21877) Michael Goin 2025-08-01 19:18:38 -04:00
  • 881e1af43a [BugFix] Harden distributed DP startup (#21538) Nick Hill 2025-08-01 22:40:45 +01:00
  • d84b97a3e3 Add lora test for tp>1 case for TPU. (#21970) XiongfeiWei 2025-08-01 11:56:08 -07:00
  • d331759488 Introduce RayPPCommunicator for ray-based PP (#21660) Rui Qiao 2025-08-01 11:50:58 -07:00
  • 9659bc7f27 [compile][startup] Disable C++ compilation of symbolic shapes (#20836) Animesh Jain 2025-08-01 10:38:52 -07:00
  • 3277e8f9e1 Fix pre-commit failure for SECURTIY.md (#22102) Michael Goin 2025-08-01 13:36:07 -04:00
  • 8d705996df [Misc] Minor enhancement of benchmark_moe (#22068) Jee Jee Li 2025-08-02 01:35:30 +08:00
  • 38c8bce8b6 Enable headless models for pooling in the Transformers backend (#21767) Harry Mellor 2025-08-01 18:31:29 +01:00
  • ac45c44d98 [Bugfix] [Performance] DeepEPHighThroughput + DeepSeek : Quant before Dispatch (#21837) Varun Sundar Rabindranath 2025-08-01 22:44:38 +05:30
  • d6664664b4 security policy: take 1 (#21119) Huzaifa Sidhpurwala 2025-08-01 21:09:49 +04:00
  • b879ecd6e2 [Bugfix] fix when skip tokenizer init (#21922) rongfu.leng 2025-08-02 01:09:36 +08:00
  • 3f8e952179 [Bugfix] Fix glm4.1v video inference issue (#22067) Isotr0py 2025-08-02 00:33:30 +08:00
  • 326a1b001d Improve documentation of ModelConfig.try_get_generation_config to prevent future confusion (#21526) Harry Mellor 2025-08-01 17:32:27 +01:00
  • 2d7b09b998 Deprecate --disable-log-requests and replace with --enable-log-requests (#21739) Harry Mellor 2025-08-01 17:16:37 +01:00
  • 97608dc276 [Docs] use uv in CPU installation docs (#22089) David Xia 2025-08-01 10:55:55 -04:00
  • 3146519add [BugFix] Don't change title of top-level process (#22032) Nick Hill 2025-08-01 15:37:55 +01:00
  • 8026a335a1 [BugFix] Update AttnFusionPass cache key (#21947) Richard Zou 2025-08-01 10:11:29 -04:00
  • a59cd9d9f7 [Refactor] Fix Compile Warning #1444-D (#21462) Wentao Ye 2025-08-01 09:10:30 -04:00
  • 5c54d9759d [Bugfix][PD] set max_completion_tokens=1 if req has this value (#21841) Abirdcfly 2025-08-01 21:08:45 +08:00
  • 0a6d305e0f feat(multimodal): Add customizable background color for RGBA to RGB conversion (#22052) Gamhang 2025-08-01 21:07:33 +08:00
  • f81c1bb055 [Bugfix] Check NVIDIA artifactory is accessible before using flashinfer cubin kernels (#21893) Michael Goin 2025-08-01 08:28:45 -04:00
  • fb0e0d46fc Fix get_kwargs for case where type hint is list[Union[str, type]] (#22016) Harry Mellor 2025-08-01 13:26:42 +01:00
  • 26b5f7bd2a [BUG] [ROCm] Fix import bug on ROCm (#22083) TJian 2025-08-01 05:25:20 -07:00
  • dfbc1f8880 [Speculative Decoding] Add speculators config support (#21345) Dipika Sikka 2025-08-01 08:25:18 -04:00
  • 87c94bc879 Revert "Update sampling_metadata.py (#21937)" (#22088) Harry Mellor 2025-08-01 13:24:46 +01:00
  • 28b18cc741 [Quantization] Enable BNB support for InternS1 (#21953) Jee Jee Li 2025-08-01 19:09:54 +08:00
  • 4931486988 [Doc] Added warning of speculating with draft model (#22047) WeiQing Chen 2025-08-01 17:11:56 +08:00
  • 0f81b310db [Misc] Remove upper bound in openai package version (#22060) Woosuk Kwon 2025-08-01 02:11:40 -07:00
  • e6680f9e25 [Bugfix] Add log prefix in non-dp mode engine core (#21889) wuhang 2025-08-01 17:04:16 +08:00
  • 27a145e893 [Doc] Add example for Step3-VL (#22061) Roger Wang 2025-08-01 01:35:49 -07:00