Commit Graph

  • a69693e38f Migrate Qwen inputs to TensorSchema (#23473) Benji Beck 2025-08-27 19:43:26 -07:00
  • 5da4f5d857 [Bugfix] Fix for V1 priority scheduling crashes at preemption (#23713) Hanchenli 2025-08-27 17:44:52 -07:00
  • 321938e9ac [Feature] Add VLLM_DISABLE_PAD_FOR_CUDAGRAPH to Avoid Hang Issue (#23595) Wentao Ye 2025-08-27 17:52:24 -04:00
  • f9ca2b40a0 [Bugfix] Fix Marlin NVFP4 for modelopt (#23659) Michael Goin 2025-08-27 17:48:16 -04:00
  • afe23a2990 use abosolute path yewentao256 2025-08-27 21:44:27 +00:00
  • e92676ef4e update for fp8 yewentao256 2025-08-27 21:36:03 +00:00
  • 082cc07ef8 DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 (#23608) Yongye Zhu 2025-08-27 17:33:21 -04:00
  • 57f2f26a05 update directory for cutlass w8a8 yewentao256 2025-08-27 21:05:41 +00:00
  • 853c371fc3 [V1][Mamba] - Enable V1 by default for Mamba Models (#23650) Asaf Joseph Gardin 2025-08-27 23:53:30 +03:00
  • c643e63f98 Merge branch 'main' into wye-refactor-quant-folder yewentao256 2025-08-27 20:29:14 +00:00
  • 8bf6266a17 [Multimodal] Generate mm_hash based on request metadata when caching is turned off (#23690) Roger Wang 2025-08-27 13:24:31 -07:00
  • 0585a9e73c Disable torch.compile for dynamic rope models in Transformers backend (#23738) Harry Mellor 2025-08-27 20:03:05 +01:00
  • 3c0ef769ba ci: Add arm64 docker build to release pipeline (#23210) Eli Uriegas 2025-08-27 10:41:48 -07:00
  • 4e4d017b6f [Docs] Fix warnings in mkdocs build (continued) (#23743) Hyogeun Oh (오효근) 2025-08-28 02:17:29 +09:00
  • dd58932280 [V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models (#22589) Thomas Parnell 2025-08-27 19:05:16 +02:00
  • 52883ed084 [Model] Merge SupportsMultiModalWithRawInput with SupportsMultiModal (#23749) Cyrus Leung 2025-08-28 01:01:50 +08:00
  • 4f35be10a9 [BugFix] Fix topk_softmax assert (#19764) Luka Govedič 2025-08-27 12:47:28 -04:00
  • 2b61d2e22f [Docs] Remove in-tree Gaudi install instructions (#23628) Harry Mellor 2025-08-27 17:22:21 +01:00
  • 3ce8285d6d [LogitsProcs] Deduplicate built-in LP implementation logic (#23362) Nick Hill 2025-08-27 08:11:33 -07:00
  • 83f555f637 [Doc]: upgrade version of crate-ci tool for improved typo detection (#23755) Didier Durand 2025-08-27 16:59:34 +02:00
  • 841490434a [Model] Enable native HF format InternVL support (#23742) Isotr0py 2025-08-27 22:45:17 +08:00
  • 3af47c3cc6 [Feature] Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt (#23666) Wentao Ye 2025-08-27 10:09:08 -04:00
  • 513c1fe255 Only run get_attr_docs if generating help text (#23723) Harry Mellor 2025-08-27 14:55:12 +01:00
  • fe8d7b6f03 [Model] Interface to enable batch-level DP support (#23733) Cyrus Leung 2025-08-27 21:41:22 +08:00
  • 16dc4052b0 Fix pre-commit on main (#23747) Harry Mellor 2025-08-27 14:39:48 +01:00
  • 8dd2baa597 Add vLLM Korea Meetup in the README.md and meetups.md (#23746) rebel-hongseok 2025-08-27 22:25:49 +09:00
  • 5eeef1b908 [Model] Explicit default_pooling_type interface (#23736) Cyrus Leung 2025-08-27 21:24:09 +08:00
  • 704432af3c [V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models (#23716) Thomas Parnell 2025-08-27 14:51:54 +02:00
  • a403d0fa41 [Misc] Remove unnecessary _send_reconfig_message() in core_client.py (#23127) Nick Hill 2025-08-27 05:50:47 -07:00
  • 8c13820f0b [Bugfix] Fix task field initialization when PYTHONOPTIMIZE is enabled (#23718) cndoit18 2025-08-27 20:42:20 +08:00
  • 9d30de4469 [model] Support MiniCPM-V 4.5 (#23586) tc-mb 2025-08-27 20:38:00 +08:00
  • 1f7a9c95e4 [Docs] Fix a 1-2-3 list and style issues in tpu.md (#23729) Michael Yao 2025-08-27 20:37:52 +08:00
  • 8f0d7eaea8 [XPU] Fix OOM issue for data parallel with Ray backend (#22500) Fanli Lin 2025-08-27 19:57:38 +08:00
  • e03940762b [CI/Build] Reduce LoRA layer test cases (#23721) Jee Jee Li 2025-08-27 18:59:35 +08:00
  • 11eddf02f0 [FlashInfer] Cache hyper params in metadata builder (#23732) Woosuk Kwon 2025-08-27 03:45:04 -07:00
  • 04ff1e43fb [Misc] Move CpuGpuBuffer to vllm/v1/utils.py (#23728) Woosuk Kwon 2025-08-27 03:25:00 -07:00
  • 6578e87365 Optimize input preparation for FlashInfer [2/N] (#23174) Woosuk Kwon 2025-08-27 02:52:45 -07:00
  • 5bd9f84158 [Docs] Fix an admonition important (#23726) Michael Yao 2025-08-27 17:50:09 +08:00
  • 91e382c935 [CI/Build] Remove redundant register in model init tests (#23715) Cyrus Leung 2025-08-27 16:11:15 +08:00
  • 6446677839 [XPU]fix cuda event used in XPU model runner (#23708) Kunshang Ji 2025-08-27 15:27:14 +08:00
  • 69244e67e6 [Core] Use key-only cache for BaseMultiModalProcessor (#23018) Cyrus Leung 2025-08-27 14:19:13 +08:00
  • 8dbf6ed7be [Bugfix] fix when config.yaml config value is list parse error (#23528) rongfu.leng 2025-08-27 13:54:39 +08:00
  • 9de25c294b [CI/Build] Remove redundant LoRA model tests (#23706) Jee Jee Li 2025-08-27 13:51:50 +08:00
  • fce10dbed5 [XPU] Add xpu torch.compile support (#22609) Kunshang Ji 2025-08-27 13:33:27 +08:00
  • d272415e57 [Quantization] Expand compressed-tensors MoE matching logic to support NFP4 + FP8 MoEs (#22674) Dipika Sikka 2025-08-27 01:00:21 -04:00
  • 142ac08030 [Frontend] Optimize beam search performance by limiting concurrency (#23599) Chen Zhang 2025-08-26 21:59:14 -07:00
  • 3210264421 [Frontend] Add --log-error-stack to print stack trace for error response (#22960) Chen Zhang 2025-08-26 21:58:59 -07:00
  • 644d57d531 [Model] Add Ernie4.5 VL Model Support (#22514) CSWYF3634076 2025-08-27 12:02:55 +08:00
  • c905684cfe [Core] Asynchronous h2d in merge_multimodal_embeddings via pinned memory. (#23686) Chenheli Hua 2025-08-26 20:05:34 -07:00
  • 786835807b [Bugfix]: Qwen3 Coder Tool Parser (#23099) Yiheng Xu 2025-08-27 10:58:32 +08:00
  • fecbb7c782 [Bugfix][gpt-oss] passing the cache config in gpt-oss (#23613) Wei 2025-08-26 19:54:23 -07:00
  • 6dab89b8ec [Docs] Fix math rendering in docs (#23676) Harry Mellor 2025-08-27 02:47:08 +01:00
  • de02b07db4 [Bugfix] Lazy import gpt_oss_triton_kernels_moe for mxfp4 (#23678) Michael Goin 2025-08-26 21:34:57 -04:00
  • eb1995167e [gpt-oss] Enable unit test for response API harmony integration (#23533) Chen Zhang 2025-08-26 18:23:26 -07:00
  • 2c2b140ae8 [quantization] use channel scales for w4a8 + misc fixes (#23570) czhu-cohere 2025-08-26 21:23:23 -04:00
  • c7c80af084 fix pynccl reduce_scatter (#23648) yzds 2025-08-27 09:21:11 +08:00
  • 6891205b16 [Feature][Responses API] Support MCP tool in background mode (#23494) wuhang 2025-08-27 09:06:58 +08:00
  • b1625dbe9c feat: add triton fused moe config for GLM-4.5-Air-FP8 on B200 (#23695) zixuanzhang226 2025-08-26 18:06:10 -07:00
  • 585e0bde36 [Bugfix] UnboundLocalError when GptOss reasoning specified (#23054) Federico 2025-08-27 02:29:52 +02:00
  • 714872f1a9 [Compile] Fix Cmake Warning (#23689) Wentao Ye 2025-08-26 19:48:32 -04:00
  • 5f1af97f86 [V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in V1 (#22594) Thomas Parnell 2025-08-27 01:28:55 +02:00
  • c3b0fd1ee6 [V1][P/D]P2pNcclConnector supports flashinfer (#23536) Zhonghua Deng 2025-08-27 06:56:16 +08:00
  • 6421b66bf4 [Docs] Move quant supported hardware table to README (#23663) Harry Mellor 2025-08-26 23:26:46 +01:00
  • 2f13319f47 Enhance the pre-notification policy (#23532) Huzaifa Sidhpurwala 2025-08-27 00:41:36 +04:00
  • d696f86e7b [doc] Hybrid KV Cache Manager design doc (#22688) Chen Zhang 2025-08-26 13:19:05 -07:00
  • 9816b81f5f [Model] Enable video support for InternVL3.5 models (#23658) Isotr0py 2025-08-27 03:46:52 +08:00
  • c37c0af990 [Misc] Fix comments in tests/kernels/quantization (#23675) Jiangyun Zhu 2025-08-27 03:31:20 +08:00
  • 9715f7bb0f [Bugfix] Fix incorrect original shape in hashing (#23672) Cyrus Leung 2025-08-27 03:01:25 +08:00
  • 98aa16ff41 [v1] Add cross-attention KV cache support for encoder-decoder models (#23664) Russell Bryant 2025-08-26 14:49:06 -04:00
  • 227e231b55 [Docs] [V1] [Hybrid] Update docs to remove FlashInfer constraint for hybrid models (#23665) Thomas Parnell 2025-08-26 20:33:16 +02:00
  • 730d0ac8b9 [Docs] Fix warnings in mkdocs build (#23649) Hyogeun Oh (오효근) 2025-08-27 03:19:23 +09:00
  • 9b0187003e [Bugfix] Fix cuda event usage with CPU model runner (#23643) Li, Jiang 2025-08-27 01:10:42 +08:00
  • 44ac25eae2 [CI] [Doc]: Add GH Action for auto labeling issues with rocm tag (#20988) vllmellm 2025-08-27 00:20:13 +08:00
  • 7ea22e42d5 [Misc] Add override for allreduce fusion thresholds (#23639) nvjullin 2025-08-26 23:53:04 +08:00
  • 9d4183dd2e [model] support qwen2audio embedding input (#23625) Yuekai Zhang 2025-08-26 23:48:08 +08:00
  • 513298f1b4 [Bugfix] fix bf16 multimodal model hash (#23623) Yuekai Zhang 2025-08-26 23:47:50 +08:00
  • 379f828fba [Docs] Reduce requirements for docs build (#23651) Harry Mellor 2025-08-26 16:43:28 +01:00
  • 1fdc732419 [ROCm] Starting to add AMD code reviewers for ROCm components (#23496) Hongxia Yang 2025-08-26 10:32:37 -04:00
  • f58675bfb3 [CPU] add cpu fused moe pytorch native implementation (#23146) TianyuLi0 2025-08-26 22:09:17 +08:00
  • 7c04779afa [Doc]: fix various spelling issues in multiple files (#23636) Didier Durand 2025-08-26 16:05:29 +02:00
  • f66673a39d [Kernel] Added flashinfer fp8 per-tensor gemms (#22895) nvjullin 2025-08-26 21:54:04 +08:00
  • b78bed1bc5 [Hardware][Mac] Fix the installation fail for Apple Silicon (CPU) (#23565) En Ouyang 2025-08-26 21:04:25 +08:00
  • 164b2273c8 [Docs] Fix broken links to docs/api/summary.md (#23637) Harry Mellor 2025-08-26 14:00:18 +01:00
  • 2b4fc9bd9b Support FlashAttention Backend for Hybrid SSM Models (#23299) Chen Zhang 2025-08-26 05:41:52 -07:00
  • ebd5a77bb5 feat: add usage to TranscriptionResponse (text and json response_format) (#23576) Guillaume Calmettes 2025-08-26 14:26:26 +02:00
  • 384dd1b0a8 [Bugfix] Add missing enable_log_outputs parameter to init_app_state function (#23634) Matúš Námešný 2025-08-26 14:13:15 +02:00
  • fdeb3dac13 [Model] fix DeepSeek e_score_correction_bias dtype to fp32 (#23640) Jee Jee Li 2025-08-26 20:09:47 +08:00
  • d52358c1e0 [Perf] Remove duplicated NVFP4 blockscales to save memory (#23379) Michael Goin 2025-08-26 07:16:33 -04:00
  • 6ace2f72b0 Fix writing benchmark results with tuple keys (#23633) Huy Do 2025-08-26 04:16:09 -07:00
  • b00e69f8ca Fix nits from #20059 (#23548) Harry Mellor 2025-08-26 11:27:20 +01:00
  • 50fede6634 [V1] Enable V1 for compute capability < 8.0 + FP32 (#23614) Cyrus Leung 2025-08-26 18:00:18 +08:00
  • b5d34af328 [Bugfix] Fix scheduling when repeated images in one request (#23544) Roger Wang 2025-08-26 02:46:28 -07:00
  • 9b5f64238f [Bugfix] Fix Qwen25VL packed_modules_mapping (#23604) Jee Jee Li 2025-08-26 16:09:14 +08:00
  • ff77764f86 Fix CLI parameter documentation inconsistency in pooling_models.md (#23630) Raghavan 2025-08-26 13:35:37 +05:30
  • bfc1edc9f5 [Docs] Fix titles for multi-file examples that are rendered in the docs (#23573) Harry Mellor 2025-08-26 08:16:44 +01:00
  • 3ecbb14b81 [Benchmarks] add benchmark for embedding models (#23000) Jiangyun Zhu 2025-08-26 14:57:08 +08:00
  • 7d67a9d9f9 [mypy] Fix incorrect type hint for EAGLE3 support (#23617) Cyrus Leung 2025-08-26 14:50:17 +08:00
  • 959783fb99 [fix] fix seed-oss-parser (#23560) Bin Jia 2025-08-26 14:16:36 +08:00
  • ce0e9dbd43 [CI/Build] Fix typo in #23561 (#23616) Cyrus Leung 2025-08-26 14:13:03 +08:00
  • b395b3b0a3 [Disagg][Perf] Use CUDA event sync instead of blocking tolist to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT (#22760) Zijing Liu 2025-08-25 21:06:00 -07:00