Commit Graph

  • 59f3b93636 [DOC] update v1_guide with INTEL HW (#22679) Chendi.Xue 2025-08-12 03:22:49 -05:00
  • 78077d5417 Move SchedulerConfig from config/__init__.py to config/scheduler.py (#22626) Harry Mellor 2025-08-12 08:23:49 +01:00
  • 6d729c43fb [Bugfix] Fix ModernBert load & Enable sliding window attention for bidirectional attention. (#22637) wang.yuqi 2025-08-12 15:23:17 +08:00
  • 2f4657952b [doc] Update x86 CPU-inference installation doc to reflect optionality of AVX512f (#22707) Sooraj S 2025-08-12 12:51:08 +05:30
  • 3a7e3bbdd2 [Doc] Added unmentioned required option "method" in the usage of EAGLE-3 based models (#21737) Hongsheng Liu 2025-08-12 15:14:51 +08:00
  • 4fbd8bb597 Fix passing SpeculativeConfig from the CLI (#22652) Harry Mellor 2025-08-12 06:13:32 +01:00
  • ad344ef552 [gpt-oss] Small bug fixes for frontend (#22512) Chen Zhang 2025-08-11 22:04:38 -07:00
  • bbaf9e9cb1 [gpt-oss] Fix mxfp4 support (#22700) Chen Zhang 2025-08-11 21:22:26 -07:00
  • 4678503476 Migrate MiniCPMVImageInputs to TensorSchema (#21939) Benji Beck 2025-08-11 20:43:37 -07:00
  • 93d0652433 [CI] Increase timeout for test_completion_with_image_embeds (#22670) Michael Goin 2025-08-11 23:31:36 -04:00
  • ea1292ad3e [CI Failure] Use float32 for tests/entrypoints/openai/test_audio.py (#22686) Michael Goin 2025-08-11 23:20:42 -04:00
  • dc5e4a653c Upgrade FlashInfer to v0.2.11 (#22613) Po-Han Huang (NVIDIA) 2025-08-12 10:58:41 +08:00
  • 839ab00349 Re-enable Xet on TPU tests now that hf_xet has been updated (#22666) Harry Mellor 2025-08-12 03:54:40 +01:00
  • 9b94d6ec8f Enable 4bit bnb prequant MOE (#21548) Andy Chen 2025-08-11 19:02:14 -07:00
  • 1891a265d3 [gpt-oss] Add test for response API + harmony (but skipped) (#22554) Chen Zhang 2025-08-11 17:47:24 -07:00
  • 95a935fc48 [gpt-oss] Support streaming in response API (#22431) Chen Zhang 2025-08-11 17:46:59 -07:00
  • 458e74eb90 Support more parallel styles in Transformers backend TP (#22651) Harry Mellor 2025-08-11 18:42:48 +01:00
  • 65abe111a3 [CI] Skip Tree Attn Test in test_max_len.py to unblock CI (#22664) TJian 2025-08-11 10:36:05 -07:00
  • 807d21b80d [BugFix] [Spec Decode] Remove LlamaForCausalLMEagle3 to fix CI (#22611) 22quinn 2025-08-11 10:31:36 -07:00
  • c90fb03df5 [CI/Build] Skip Mllama HF runner tests with Transformers v4.55.0 (#22659) Isotr0py 2025-08-12 01:00:58 +08:00
  • 84cf78acee [Model] Pooling models default to using chunked prefill & prefix caching if supported. (#20930) wang.yuqi 2025-08-12 00:41:37 +08:00
  • 16fb668b61 fix: NIXL connector transfers partial block to pass full multi-modal context (#21074) GuanLuo 2025-08-11 09:40:55 -07:00
  • f7dcce7a4a [Feature] Add VLLM_USE_DEEP_GEMM_E8M0 Env to Control E8M0 Scale (#21968) Wentao Ye 2025-08-11 12:39:08 -04:00
  • 8e13d9fe6d [Misc] Further clean up some redundant config definitions (#22649) Isotr0py 2025-08-12 00:22:25 +08:00
  • 3fa5b25845 Document aarch64 CPU support works (#22646) Eric Curtin 2025-08-11 15:22:45 +01:00
  • 14a5d903ab [Model] NemotronH Support (#22349) danielafrimi 2025-08-11 14:09:24 +03:00
  • 951b038298 [Misc] Move jsontree to utils (#22622) Cyrus Leung 2025-08-11 18:49:32 +08:00
  • ebf7605b0d [Misc] Move tensor schema tests (#22612) Cyrus Leung 2025-08-11 15:15:27 +08:00
  • bc1d02ac85 [Docs] Add comprehensive CLI reference for all large vllm subcommands (#22601) Harry Mellor 2025-08-11 08:13:33 +01:00
  • 1e55dfa7e5 [BUGFIX] KeyError 'layers.14.mlp.gate.g_idx' for Qwen3-MoE with GPTQ on ROCm (#22017) JartX 2025-08-11 09:13:30 +02:00
  • 384a052971 [Misc] benchmark_moe supports expert parallel (#22251) Jee Jee Li 2025-08-11 15:13:27 +08:00
  • 39052dbca8 Support token_type_ids in V1 with less code changes (#21985) Maximilien de Bayser 2025-08-11 02:54:59 -03:00
  • 9c97a1c349 [ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module. (#22521) vllmellm 2025-08-11 13:52:34 +08:00
  • f919d4cb8f [BugFix] Fix logits repetition penalty cuda check (#22592) Eugene Cheah 2025-08-10 22:52:31 -07:00
  • afa5b7ca0b [Misc][gpt-oss] guard import when triton kernel when not up to date (#22584) Zhewen Li 2025-08-10 21:29:35 -07:00
  • 1b99028069 [Misc][gpt-oss] Add rules to label gpt-oss related PRs (#22600) Lifans 2025-08-10 19:49:51 -07:00
  • 5898b135ab [BugFix] Fix KVConnectorOutput TPU breakage (#22598) Nick Hill 2025-08-10 19:33:48 -07:00
  • b799f4b9ea [CI/Build] Fix tensorizer test for load_format change (#22583) 22quinn 2025-08-10 19:30:00 -07:00
  • 06da44f0cb Migrate LlavaImageInputs to TensorSchema (#21770) Benji Beck 2025-08-10 19:29:19 -07:00
  • a554991748 Migrate LlavaNextVideoPixelInputs to TensorSchema (#21843) Benji Beck 2025-08-10 19:29:16 -07:00
  • d1af8b7be9 enable Docker-aware precompiled wheel setup (#22106) Doug Smith 2025-08-10 19:29:02 -04:00
  • 68b254d673 Fix TensorSchema validation test for symbolic dims (#22366) Benji Beck 2025-08-10 10:16:44 -07:00
  • 8c50d62f5a Remove redundant row_indices unsqueeze operation in MiniCPMO (#22528) ZiTian Zhao 2025-08-11 00:20:00 +08:00
  • b4e2916721 Migrate LlavaNextImageInputs to TensorSchema (#21774) Benji Beck 2025-08-10 09:05:21 -07:00
  • 65a7917be4 Fix(benchmarks): allow multiple mm contents in OpenAI Chat Completion Benchmarks (#22534) Breno Baldas Skuk 2025-08-10 18:03:15 +02:00
  • b76753f0b5 [Bugfix][Kernel] Support partial rotary embedding for MRoPE triton kernel (#22593) Isotr0py 2025-08-11 00:00:36 +08:00
  • b81fe83b2c [doc] add alibaba cloud as sponsor (#22597) youkaichao 2025-08-10 23:13:47 +08:00
  • 0757551c96 [doc] add beijing meetup links (#22596) youkaichao 2025-08-10 22:51:36 +08:00
  • 8290d15d2c Move CacheConfig from config/__init__.py to config/cache.py (#22586) Harry Mellor 2025-08-10 15:36:40 +01:00
  • 049c245143 [Misc] Replace flaky image urls in pixtral test (#22574) Isotr0py 2025-08-10 21:18:21 +08:00
  • 00976db0c3 [Docs] Fix warnings in docs build (#22588) Harry Mellor 2025-08-10 13:49:51 +01:00
  • d411df0296 [Misc] Further refine type annotations in parallel state (#22499) Cyrus Leung 2025-08-10 20:49:48 +08:00
  • 010e0e39ea [Doc] Fix API doc link in side navigation (#22585) 22quinn 2025-08-10 01:35:22 -07:00
  • 326976291b [Misc] code clean duplicate set_current_vllm_config in _set_vllm_config (#22566) Ning Xie 2025-08-10 15:08:48 +08:00
  • 7e8d685775 [Minor] Fix pre-commit error on main (#22579) Isotr0py 2025-08-10 15:08:23 +08:00
  • c49848396d Refactor sliding window configuration to Transformers best practice (#21927) Harry Mellor 2025-08-10 04:50:48 +01:00
  • 2a84fb422f [TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block (#22394) Chengji Yao 2025-08-09 20:49:04 -07:00
  • 534c45b962 Improve fast_topk function with type hints and documentation (#22530) ZiTian Zhao 2025-08-10 11:25:42 +08:00
  • 3d7363e61c [Config] add "qwen" as a native eagle3 target supported model (#22333) Le Chen 2025-08-10 11:21:05 +08:00
  • 0c5254b82a [oss] Init gpt-oss bf16 support (#22508) Jee Jee Li 2025-08-10 11:19:13 +08:00
  • 61f67d8acd [V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers (#21401) Thomas Parnell 2025-08-10 05:16:11 +02:00
  • 42172ad18f [FEAT] [Performance] Add triton mrope to replace the torch code path (#22375) TJian 2025-08-09 11:50:03 -07:00
  • fbd8595c5c [Bugfix] Fix basic models tests hanging due to mm processor creation (#22571) Isotr0py 2025-08-10 02:42:21 +08:00
  • 5a16fa614c [Model] Gemma3n MM (#20495) Nicolò Lucchesi 2025-08-09 18:56:25 +02:00
  • 2d18256e47 Move ParallelConfig from config/__init__.py to config/parallel.py (#22565) Harry Mellor 2025-08-09 16:33:46 +01:00
  • 56186474f6 [Docs] Reduce noise in docs and --help from the JSON tip (#22567) Harry Mellor 2025-08-09 16:31:32 +01:00
  • 1bf5e1f25b [CI] [Hybrid] Speed up hybrid models test by removing large models (#22563) Thomas Parnell 2025-08-09 11:04:42 +02:00
  • a6022e6fbc GLM-4.5V with new class name at transformers (#22520) Yuxuan Zhang 2025-08-09 15:50:21 +08:00
  • 2be07a0db1 Update docs for Minimax-Text support (#22562) Thomas Parnell 2025-08-09 09:18:18 +02:00
  • 0edc0cd52b [Bugfix] Fix CI moe kernel failure (#22556) Jee Jee Li 2025-08-09 15:03:29 +08:00
  • 7920e9b1c5 [Bugfix] Fix failing GPT-OSS initialization test (#22557) Isotr0py 2025-08-09 15:03:26 +08:00
  • b7c0942b65 [ROCm][Misc] Rename the context_len to seq_len in ROCm custom paged attention kernel (#22097) Charlie Fu 2025-08-09 01:15:06 -05:00
  • 9a0c5ded5a [TPU] Add support for online w8a8 quantization (#22425) Kyuyeun Kim 2025-08-08 23:12:54 -07:00
  • 10a02535d4 Fix loading of quantized BigCode models (#22463) Eldar Kurtić 2025-08-09 08:12:12 +02:00
  • 65552b476b [Misc] Use config definitions from Transformers library (#21913) Cyrus Leung 2025-08-09 14:10:51 +08:00
  • 7ad7adb67f v1: Pass KVConnectorOutput to scheduler-side (#22157) Or Ozeri 2025-08-09 09:09:51 +03:00
  • 6ade99eafa [V1] [Hybrid] Support Minimax-Text-01 in V1 (#22151) Thomas Parnell 2025-08-09 08:08:48 +02:00
  • 3157aebb63 [Log] Add Warning for Deprecation of DeepGEMM old version (#22194) Wentao Ye 2025-08-09 02:07:48 -04:00
  • 8a0ffd6285 Remove mamba_ssm from vLLM requirements; install inside test container using --no-build-isolation (#22541) Thomas Parnell 2025-08-09 08:05:32 +02:00
  • 23472ff51c [Doc] Add usage of implicit text-only mode (#22561) Roger Wang 2025-08-08 23:04:19 -07:00
  • 08b751ba74 Implicit language-model-only mode via limit-mm-per-prompt (#22299) Roger Wang 2025-08-08 22:21:40 -07:00
  • 429e4e2d42 [Bugfix] Fix ModernBert cuda graph capturing in v1 (#21901) Isotr0py 2025-08-09 13:17:22 +08:00
  • 35afe1b30b [BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec Decoding and P/D (#22317) Pradyun92 2025-08-08 20:04:15 -04:00
  • 81c57f60a2 [XPU] upgrade torch 2.8 on for XPU (#22300) Kunshang Ji 2025-08-09 08:03:45 +08:00
  • 311d875614 Drop flaky test_healthcheck_response_time (#22539) Russell Bryant 2025-08-08 19:56:47 -04:00
  • e3edc0a7a8 Extract CompilationConfig from config.py (#22524) Harry Mellor 2025-08-09 00:34:25 +01:00
  • baece8c3d2 [Frontend] Add unix domain socket support (#18097) yyweiss 2025-08-09 02:23:44 +03:00
  • 2fcf6b27b6 [Docs] fix broken links in metrics.md (#22315) Guy Stone 2025-08-08 19:22:35 -04:00
  • 41b9655751 Skip Qwen 1 in CI because remote code is no longer compatible with Transformers (#22536) Harry Mellor 2025-08-09 00:20:58 +01:00
  • bd875d2eb7 [Bugfix] Update FA commit hash (#22546) Thomas Parnell 2025-08-09 01:10:25 +02:00
  • f703b923f3 [Misc] DeepGEMM : Avoid JIT generation in the hot-path (#22215) Varun Sundar Rabindranath 2025-08-08 19:09:59 -04:00
  • cd9b9de1fb [BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA (#21691) Lucas Wilkinson 2025-08-08 19:09:42 -04:00
  • fe6d8257a1 [gpt-oss] Support tool call and implement MCP tool server (#22427) Chen Zhang 2025-08-08 15:06:37 -07:00
  • e290594072 [Docs] Rename “Distributed inference and serving” to “Parallelism & Scaling” (#22466) Ricardo Decal 2025-08-08 12:26:21 -07:00
  • f756a682d9 [gpt-oss] guard import when triton kernel is not installed (#22529) Yongye Zhu 2025-08-08 11:18:33 -07:00
  • f0964e29cb [Benchmark] Add benchmark tool for multi turn conversations (#20267) Daniel Serebrenik 2025-08-08 20:28:50 +03:00
  • e789cad6b8 [gpt-oss] triton kernel mxfp4 (#22421) Yongye Zhu 2025-08-08 08:24:07 -07:00
  • e5ebeeba53 Remove exception for Python 3.8 typing from linter (#22506) Harry Mellor 2025-08-08 11:06:46 +01:00
  • 7be7f3824a [Docs] Improve API docs (+small tweaks) (#22459) Harry Mellor 2025-08-08 11:02:51 +01:00
  • ccdae737a0 [BugFix] Don't cancel asyncio tasks directly from destructors (#22476) Nick Hill 2025-08-08 01:13:18 -07:00