Commit Graph

  • 5406ebf5c9 [CI] Pooling models mteb test uses enforce_eager (#22878) wang.yuqi 2025-08-15 16:16:15 +08:00
  • b2c06509e5 [P/D]Provide bucket algorithm rate limiter for proxy_server (#22643) frankie 2025-08-15 15:01:48 +08:00
  • b2f6c247a9 Revert "[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module." (#22956) TJian 2025-08-14 23:39:19 -07:00
  • 3d232dbd19 [Mamba] - refactor: Renamed mamba_attn to mamba2_attn (#22818) Asaf Joseph Gardin 2025-08-15 09:38:05 +03:00
  • 5c3fbfe46b [Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughput Improvement (#22763) Wentao Ye 2025-08-15 02:27:30 -04:00
  • b4cef5e6c7 refactor: Change scaling factors calculation for flashinfer FusedMoE (#22812) amirkl94 2025-08-15 09:19:31 +03:00
  • 0fe85087a9 [CI Perf] Prune tests in tests/kernels/attention/ (#22936) Michael Goin 2025-08-14 23:34:53 -04:00
  • d2b0e97ea6 [CI Perf] Prune tests in tests/kernels/moe/ (#22939) Michael Goin 2025-08-14 23:33:42 -04:00
  • 590bddbfc5 [CI Perf] Prune tests in tests/kernels/quantization/ (#22942) Michael Goin 2025-08-14 23:25:34 -04:00
  • ae05a6d83d [BugFix] Fix port lookup in internal DP LB tests (#22252) Nick Hill 2025-08-14 20:17:11 -07:00
  • 0933f9d518 [BugFix][KVConn] Fix use of get_required_kvcache_layout (#22734) Nick Hill 2025-08-14 18:39:43 -07:00
  • f1f0d2fab8 Revert "[Kernel] Add cuda kernel for gpt_oss activation" (#22948) Simon Mo 2025-08-14 17:38:10 -07:00
  • 81f4b96481 [Kernel] Add cuda kernel for gpt_oss activation (#22538) Jee Jee Li 2025-08-15 08:21:29 +08:00
  • 39cd09dc86 [Bugfix] use flash attn on sm90 (#22933) Yongye Zhu 2025-08-14 19:37:22 -04:00
  • 919234fe17 [BugFix] Fix initial DP request load imbalance (#22910) Nick Hill 2025-08-14 15:20:28 -07:00
  • ebcce2cd36 [Core] Return final response for aborted requests from AsyncLLM.generate (#22283) Nick Hill 2025-08-14 14:49:02 -07:00
  • 4121de512e [Quantization]: Support compressed-tensors mixed-precision model loading (#22468) Dipika Sikka 2025-08-14 17:32:09 -04:00
  • 279a5f31b3 [Kernel] Add nvfp4 gemm flashinfer backends (#22346) nvjullin 2025-08-15 04:03:55 +08:00
  • b8ff05361a [CI] Temporarily disable flaky test (#22930) Lucas Wilkinson 2025-08-14 15:59:16 -04:00
  • 637093ae26 docs: update fastsafetensors usage instructions (#22891) Nir 2025-08-14 22:56:54 +03:00
  • 33c63e9547 [Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel (#22428) Jinzhen Lin 2025-08-15 02:23:22 +08:00
  • ab9f2cfd19 [CI] [Hybrid] Bump min transformers version for Bamba and Jamba (#22908) Thomas Parnell 2025-08-14 20:01:16 +02:00
  • 52c905a3d4 Merge branch 'vllm-project:main' into wye-refactor-quant-folder Wentao Ye 2025-08-14 11:12:23 -04:00
  • dbe298046c [Bugfix] Fix parsing of --disable-mm-preprocessor-cache (#22909) Cyrus Leung 2025-08-14 23:09:44 +08:00
  • 625ccd1c4d [Bugfix] Replace custom Encoding class with BatchEncoding in MistralTokenizer (#22786) Jiangyun Zhu 2025-08-14 23:09:27 +08:00
  • 92ff41abea [Model] Modify the gate implementation of glm4_moe (#22832) Jee Jee Li 2025-08-14 20:28:50 +08:00
  • 829b9a62d0 [Perf] Dont create unnecessary pooling params (#22876) Lucas Wilkinson 2025-08-14 08:28:09 -04:00
  • 540d54ca8d [CI] Re-enable transcriptions test_long_audio_request (#22890) Nicolò Lucchesi 2025-08-14 13:34:34 +02:00
  • 0783f13960 [Doc] fix dead link (#22898) Daniele 2025-08-14 13:06:13 +02:00
  • 7655dc3e45 [Bugfix] Add reset prefix cache for online serving (#22726) iAmir97 2025-08-14 18:04:18 +07:00
  • f4efda821d Remove Phi 4 Flash configuration workaround (#22723) Harry Mellor 2025-08-14 12:03:49 +01:00
  • eb08487b18 [BugFix] Threadsafe close async zmq sockets (#22877) Nick Hill 2025-08-14 03:44:29 -07:00
  • 7c3a0741c6 [Bugfix] Fix PixtralHFImagePixelInputs dynamic shape check (#22827) Isotr0py 2025-08-14 17:35:43 +08:00
  • 00e3f9da46 vLLM Benchmark suite improvement (#22119) Louie Tsai 2025-08-14 00:12:17 -07:00
  • a353bd083d [CI] remove flaky v0 test (#22864) Robert Shaw 2025-08-14 00:41:51 -04:00
  • 1d20c34717 [CI] Fix tests/distributed/test_ca_buffer_sharing.py (#22849) Ilya Markov 2025-08-14 05:09:30 +02:00
  • b6af24fba7 [CI][Entrypoints]: add filter to generation to filter out invalid tool calls (#22826) Will Eaton 2025-08-13 23:09:07 -04:00
  • 0ca2393b47 [CI/Build] Increase pooling tolerance to pass CI (#22844) Cyrus Leung 2025-08-14 06:52:48 +08:00
  • 31a500c86f [Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP (#22437) Jialin Ouyang 2025-08-13 14:44:06 -07:00
  • 4e8614e88b Move checklist in PR template (#22852) Luka Govedič 2025-08-13 17:38:35 -04:00
  • c6cd5ca3d3 [ROCm][Bugfix] Fix compilation error in topk softmax fused kernel (#22819) kliuae 2025-08-14 04:45:03 +08:00
  • df0e0f023e [CI/Build] Skip gpt_big model test because of broken HF model (#22848) Isotr0py 2025-08-14 04:36:28 +08:00
  • b4b78d6317 [CI/Build] Fix param mismatch in test_eagle_correctness (#22847) Cyrus Leung 2025-08-14 01:55:25 +08:00
  • 12817a8ac7 [CI] Fix tests/v1/e2e/test_kv_sharing_fast_prefill.py import on test (#22815) Nicolò Lucchesi 2025-08-13 19:35:50 +02:00
  • c9232d41f4 [CI/Build] Update VLM common tests (#22841) Cyrus Leung 2025-08-14 01:03:05 +08:00
  • 9bd9294f0e [Bugfix] Fix MiniCPMV Image input inference failed (#22813) HWH 2025-08-14 00:41:41 +08:00
  • e1b37e06b7 Merge branch 'vllm-project:main' into wye-refactor-quant-folder Wentao Ye 2025-08-13 10:53:20 -04:00
  • da2705198f [Misc] clear and separate error messages for input too long and input + max-tokens too long (#22803) Roger Wang 2025-08-13 07:22:56 -07:00
  • 19b927e52d [Core] Use individual MM items in P0/P1 cache and model runner (#22570) Cyrus Leung 2025-08-13 22:18:07 +08:00
  • 20d65aa755 [Frontend] Multithreaded async multimodal load_bytes (#22710) milesial 2025-08-13 06:09:26 -07:00
  • b159c0a67a Fix GGUF loader for Qwen3 MoE. (#22785) Gh0u1L5 2025-08-13 21:08:23 +08:00
  • 6772bb0f7d Remove unnecessary CUDA sync of qwen image and video preprocess (#22792) Yuanyuan Chen 2025-08-13 21:07:28 +08:00
  • fceafaf582 [Bugfix][mamba] Fix type annotation of Mamba2Metadata (#22787) Chen Zhang 2025-08-13 06:07:09 -07:00
  • 6b794c756c [Nixl][CI] Fix tests (#22806) Nicolò Lucchesi 2025-08-13 15:03:53 +02:00
  • 98deac3879 [FEATURE] support custom vllm tuned config path for fused moe triton kernels (#22791) Chi Zhang 2025-08-13 20:27:25 +08:00
  • 653124bd46 [Frontend] Add chunked processing to handle long inputs in embedding models (#22280) Kdump 2025-08-13 19:14:24 +08:00
  • 0b1bdac6af [Platform] Custom ops support for FusedMoe (#22509) wangxiyuan 2025-08-13 19:12:00 +08:00
  • d94e3026de [V1] Add tree drafting tests for eagle spec decoding (#22705) Giancarlo Delfin 2025-08-13 04:11:28 -07:00
  • 3f52738dce [Doc] Add max_lora_rank configuration guide (#22782) 633WHU 2025-08-13 19:10:07 +08:00
  • a01e0018b5 [Bugfix] Fix Nemotron VL image processing (#22739) Duc-Viet Hoang 2025-08-13 17:11:36 +07:00
  • 9e7e5baaa8 [Model] Add missing prefix to glm4_1v (#22716) Yuxuan Zhang 2025-08-13 16:23:33 +08:00
  • d16aa3dae4 [Model] Add option to run Step3VisionEncoder in DP (#22697) zzh142857 2025-08-13 03:09:13 -04:00
  • 6807af8f46 [gpt-oss] upgrade gpt-oss to v0.0.3 and add version check (#22768) Chen Zhang 2025-08-12 21:37:26 -07:00
  • 4c558cf62e [Perf] Support topk softmax fused kernel for broader num_experts (#22211) shixianc 2025-08-12 21:34:47 -07:00
  • 77a6bf07ae [Bug] Fix Unexpected Keyword Argument 'w1_bias' (#22757) Wentao Ye 2025-08-13 00:31:47 -04:00
  • 4082338a25 Remove unneeded ROCm platform import when using CUDA (#22765) Michael Goin 2025-08-13 00:26:38 -04:00
  • c6b928798e Force TRTLLM attention for gpt-oss on SM100 (#22678) Michael Goin 2025-08-13 00:22:16 -04:00
  • b1361c7273 [Bugfix] Fix default enable for CUTLASS MLA on SM100 (#22738) Michael Goin 2025-08-13 00:22:05 -04:00
  • 4f0f844b16 Fix cuda illegal mem access with Llama4 TP8 + rms_norm custom op (#22701) Po-Han Huang (NVIDIA) 2025-08-13 12:21:50 +08:00
  • c5830381af [V0 Deprecation] Remove args for multi-step scheduling (#22779) Woosuk Kwon 2025-08-12 20:38:18 -07:00
  • d31f97cf57 [Misc] Remove tests/multi_step/__init__.py (#22778) Woosuk Kwon 2025-08-12 20:21:18 -07:00
  • 71683ca6f6 [V0 Deprecation] Remove multi-step scheduling (#22138) Woosuk Kwon 2025-08-12 20:18:39 -07:00
  • e18859298d Add hardware plugins to installation doc (#22732) Michael Goin 2025-08-12 20:14:46 -04:00
  • fde0b611a3 [Model] Decouple glm4v (#22751) Jee Jee Li 2025-08-13 08:13:17 +08:00
  • d0a6301588 Fix Transformers backend tensor parallel for multimodal models (#22673) Harry Mellor 2025-08-13 01:12:30 +01:00
  • 45c3936e94 [Docs] Hide the navigation and toc sidebars on home page (#22749) Harry Mellor 2025-08-13 01:12:26 +01:00
  • ba81acbdc1 [Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues (#22606) Frank Wang 2025-08-12 15:43:06 -07:00
  • 53c730286c [Misc] parametrize 'dtype' in test_flash_mla (#22641) RUTHLESS-BOT 2025-08-13 04:31:48 +08:00
  • 6534d2fc97 Fix torch version check for SM100 mxfp4 (#22535) zifeitong 2025-08-12 12:54:42 -07:00
  • 422f22e012 [CI][Nixl] Check kv cache layout during handshake (#22745) Nicolò Lucchesi 2025-08-12 21:53:52 +02:00
  • 6bd8ebf026 [Kernel][AMD] Avoid D2H copy and cumsum kernel (#22683) Xiaozhu Meng 2025-08-12 12:53:36 -07:00
  • 66d491c494 Merge branch 'vllm-project:main' into wye-refactor-quant-folder Wentao Ye 2025-08-12 15:18:34 -04:00
  • dab4f9f764 [Chore] Update CODEOWNERS to include @yewentao256 for CUDA kernels, attention backends, quantization, and related tests (#22741) Wentao Ye 2025-08-12 12:50:31 -04:00
  • c42fe0b63a Add more test scenario for tensor schema (#22733) TeeKen Lau 2025-08-13 02:34:41 +10:00
  • 5a4b4b3729 Add: SupportsEagle3 interface for explicit EAGLE3 support (#22642) Rahul Tuli 2025-08-12 21:54:52 +05:30
  • e5d3d63c42 [Benchmark] Fix terminal colors in benchmark_serving_multi_turn (python 3.12) (#22730) Daniel Serebrenik 2025-08-12 17:41:37 +03:00
  • 3d9d40efde [Bugfix][CI] Fix test_remote_decode_lifecycle.py::test_short_prompt_lifecycle (#22727) Nicolò Lucchesi 2025-08-12 16:30:17 +02:00
  • 67c153b88a Fix Llama4 FlashInfer FP4 MoE issues (#22511) Po-Han Huang (NVIDIA) 2025-08-12 20:50:59 +08:00
  • f7ad6a1eb3 [CI Failure] fix tests/entrypoints/openai/test_skip_tokenizer.py (#22708) wang.yuqi 2025-08-12 20:42:58 +08:00
  • 80bb1e8afe Officially support SmolLM3 using the Transformers backend (#22665) Harry Mellor 2025-08-12 13:38:48 +01:00
  • d030b01548 [BugFix][Nixl][PD] Fix heterogenous TP (#22663) Nicolò Lucchesi 2025-08-12 14:37:30 +02:00
  • 767e63b860 [Docs] Improve docs navigation (#22720) Harry Mellor 2025-08-12 12:25:55 +01:00
  • 007dd90859 [gpt-oss] Enable gpt-oss on ampere (#22714) Yongye Zhu 2025-08-12 06:21:44 -04:00
  • b8a9d0e429 [Misc] remove GH discussions link (#22722) Jee Jee Li 2025-08-12 18:15:33 +08:00
  • 50f2aae1b4 [LMCache][Example] Align the PYTHONHASHSEED for prefillers and decoders for KV chunks hashing (#21161) zejunchen-zejun 2025-08-12 17:05:14 +08:00
  • 46ae7f6666 [Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 (#21783) RishiAstra 2025-08-12 05:04:37 -04:00
  • 1ece7f30ba Fix: AWQ Marlin get_quant_method does not recognize "modules_to_not_convert" (#21888) Jun-Howie 2025-08-12 17:03:53 +08:00
  • bc8372efc3 [Bugfix] Fix erroneous randomly generated cases in bad word testing (#22170) phantomlei 2025-08-12 17:03:22 +08:00
  • 8d17fa633e [V0] Correct CUDA Graph capture for encoder-decoder models (#22630) Sugar-zsg 2025-08-12 17:01:08 +08:00
  • 9f909b8996 [New Model] Support Command-A-Vision (#22660) dongluw 2025-08-12 04:39:54 -04:00