Commit Graph

  • 73e2e0118f [Quantization] Improve AWQ logic (#19431) Jee Jee Li 2025-06-12 19:02:11 +08:00
  • c9280e6346 [Bugfix] Respect num-gpu-blocks-override in v1 (#19503) jmswen 2025-06-12 04:00:23 -07:00
  • af09b3f0a0 [Bugfix][V1] Allow manual FlashAttention for Blackwell (#19492) Michael Goin 2025-06-12 06:40:24 -04:00
  • 4f6c42fa0a [Security] Prevent new imports of (cloud)pickle (#18018) Russell Bryant 2025-06-12 06:30:17 -04:00
  • dff680001d Fix typo (#19525) niu_he 2025-06-12 17:24:45 +08:00
  • 2e090bd5df [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (#19509) rasmith 2025-06-12 02:14:24 -05:00
  • 1b0b065eb5 [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (#19522) wonjun Jang 2025-06-12 16:00:47 +09:00
  • d5bdf899e4 [BugFix] Work-around incremental detokenization edge case error (#19449) Nick Hill 2025-06-11 23:43:20 -07:00
  • 7e3e74c97c [Frontend] Improve error message in tool_choice validation (#19239) 22quinn 2025-06-11 23:13:00 -06:00
  • 3f6341bf7f Add Triton Fused MoE kernel config for E=16 on B200 (#19518) Brayden Zhong 2025-06-12 00:31:51 -04:00
  • e5d35d62f5 [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (#19514) Varun Sundar Rabindranath 2025-06-12 00:28:12 -04:00
  • 2f1c19b245 [CI] change spell checker from codespell to typos (#18711) Ning Xie 2025-06-12 10:57:10 +08:00
  • 42f52cc95b [CI/Build] Fix torch nightly CI dependencies (#19505) Richard Zou 2025-06-11 17:40:42 -04:00
  • 97a9465bbc [UX] Add Feedback During CUDAGraph Capture (#19501) Robert Shaw 2025-06-11 14:09:05 -07:00
  • c7ea0b56cd [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (#17331) rasmith 2025-06-11 14:53:28 -05:00
  • 29fa5cac1c [Kernels] Add activation chunking logic to FusedMoEModularKernel (#19168) bnellnm 2025-06-11 12:53:10 -04:00
  • b2d9be6f7d [Docs] Remove WIP features in V1 guide (#19498) Woosuk Kwon 2025-06-11 09:15:03 -07:00
  • 04a55612dd [Misc] Fix misleading ROCm warning (#19486) Jee Jee Li 2025-06-12 00:12:10 +08:00
  • 89b0f84e17 [doc] fix "Other AI accelerators" getting started page (#19457) David Xia 2025-06-11 12:11:17 -04:00
  • 497a91e9f7 [CI] Update FlashInfer to 0.2.6.post1 (#19297) Michael Goin 2025-06-11 10:57:28 -04:00
  • 943ffa5703 [Bugfix] Update the example code, make it work with the latest lmcache (#19453) runzhen 2025-06-11 05:42:20 -07:00
  • 5c8d34a42c Support no privileged mode on CPU for docker and kubernetes deployments (#19241) Louie Tsai 2025-06-11 04:11:47 -07:00
  • 3c8694eabe Fix some typo (#19475) Ximingwang-09 2025-06-11 18:36:04 +08:00
  • 7484e1fce2 Add cache to cuda get_device_capability (#19436) Michael Goin 2025-06-11 05:37:05 -04:00
  • a2142f0196 Support non-string values in JSON keys from CLI (#19471) Cyrus Leung 2025-06-11 17:34:04 +08:00
  • 871d6b7c74 [Misc] Reduce warning message introduced in env_override (#19476) Lu Fang 2025-06-11 17:29:54 +08:00
  • 29a38f0352 [Doc] Support "important" and "announcement" admonitions (#19479) Cyrus Leung 2025-06-11 16:39:58 +08:00
  • a5115f4ff5 [Doc] Fix quantization link titles (#19478) Cyrus Leung 2025-06-11 16:27:22 +08:00
  • 68b4a26149 [Doc] Update V1 User Guide for Hardware and Models (#19474) Cyrus Leung 2025-06-11 15:49:06 +08:00
  • b8e809a057 [Kernel] Support deep_gemm for linear methods (#19085) artetaout 2025-06-11 15:14:45 +08:00
  • 5039ec2336 [ROCm] Add rules to automatically label ROCm related PRs (#19405) Lu Fang 2025-06-11 15:09:18 +08:00
  • 7c644ab6d5 Fix Typo in Documentation and Function Name (#19442) leopardracer 2025-06-11 08:44:11 +03:00
  • 2d40665fe8 Add fused MOE config for Qwen3 30B A3B on B200 (#19455) Junhao Li 2025-06-11 01:43:46 -04:00
  • 96ada386b7 [Misc] Remove unused MultiModalHasher.hash_prompt_mm_data (#19422) Lukas Geiger 2025-06-11 07:18:57 +02:00
  • 1e473b3010 [CI] Disable failing GGUF model test (#19454) Michael Goin 2025-06-11 01:12:38 -04:00
  • 2b1e2111b0 Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (#19451) Lu Fang 2025-06-11 12:54:59 +08:00
  • a45b979d9f [BugFix] Fix docker build cpu-dev image error (#19394) niu_he 2025-06-11 11:56:40 +08:00
  • 3952731e8f [New Model]: Support Qwen3 Embedding & Reranker (#19260) wang.yuqi 2025-06-11 11:07:30 +08:00
  • 77f0d465d0 [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (#19390) Richard Zou 2025-06-10 19:54:41 -04:00
  • 22c3c0aa4a Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (#19401) Xu Wenqing 2025-06-11 07:23:57 +08:00
  • 33f8dba7c6 [Model] use AutoWeightsLoader for commandr (#19399) py-andy-c 2025-06-10 15:42:21 -07:00
  • 5241ca50d6 [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (#19440) Gregory Shtrasberg 2025-06-10 18:06:15 -04:00
  • da9b523ce1 [Docs] Note that alternative structured output backends are supported (#19426) Russell Bryant 2025-06-10 12:20:00 -04:00
  • b6553be1bc [Misc] Slight improvement of the BNB (#19418) v0.9.1rc2 v0.9.1 Jee Jee Li 2025-06-10 21:51:49 +08:00
  • 64a9af5afa Simplify ep kernels installation (#19412) youkaichao 2025-06-10 20:06:08 +08:00
  • e4248849ec [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (#19411) Li, Jiang 2025-06-10 20:02:40 +08:00
  • 467bef18a3 [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword use_irope (#19134) Rachel Guo 2025-06-10 01:48:51 -07:00
  • 5f1ac1e1d1 Revert "[v1] Add fp32 support to v1 engine through flex attn" (#19404) Isotr0py 2025-06-10 16:30:20 +08:00
  • 9368cc90b2 Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (#17930) Louie Tsai 2025-06-09 23:22:05 -07:00
  • 32b3946bb4 Add clear documentation around the impact of debugging flag (#19369) Anna Pendleton 2025-06-09 23:16:09 -07:00
  • 6b1391ca7e [Misc] refactor neuron_multimodal and profiling (#19397) Reid 2025-06-10 14:12:42 +08:00
  • a3f66e75d1 Add security warning to bug report template (#19365) Russell Bryant 2025-06-10 02:06:36 -04:00
  • 319cb1e351 [Core] Batch multi modal input using pinned memory (#19169) Lukas Geiger 2025-06-10 07:44:59 +02:00
  • 1efef71645 [Bugfix] Fix modelscope token passed in (#19389) Li Wang 2025-06-10 13:39:37 +08:00
  • 646d62f636 [Core] Use tuple for kv cache group block ids (#19175) Nick Hill 2025-06-09 22:01:17 -07:00
  • 6cd4ae8acd [Frontend] Add tqdm_leave_pbar to control progress bar visibility (#19357) Reid 2025-06-10 12:55:09 +08:00
  • c016047ed7 Fix docs/mkdocs/hooks/remove_announcement.py (#19382) Harry Mellor 2025-06-10 05:36:54 +01:00
  • 9af6d22e4c Use xla flag to improve the quantized model performance (#19303) XiongfeiWei 2025-06-09 18:28:45 -07:00
  • 4589b94032 [Bugfix] Fix benchmark_moe.py (#19016) Tianyu Guo 2025-06-10 09:04:36 +08:00
  • cc867be19c [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (#19312) Ye (Charlotte) Qi 2025-06-09 17:40:01 -07:00
  • 3a7cd627a8 [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (#19383) v0.9.1rc1 Siyuan Liu 2025-06-09 16:41:51 -07:00
  • 8058c91108 [HOT-FIX] Add kv_sharing_target_layer_name argument to cutlass_mla backend (#19374) Pavani Majety 2025-06-09 16:00:07 -07:00
  • 7d44c469fe [TPU]Fix KV cache sharing tests (#19371) Siyuan Liu 2025-06-09 15:38:15 -07:00
  • 31f58be96a [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (#18472) liusiqian-tal 2025-06-10 05:41:21 +08:00
  • ebb2f383b8 [Quantization] Bump compressed-tensors version (#19295) Kyle Sayers 2025-06-09 17:33:15 -04:00
  • c1c7dbbeeb [Bugfix][Core] Prevent token lengths exceeding max_model_len in V0 (#19348) 22quinn 2025-06-09 08:01:29 -07:00
  • 5cf2daea9a [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (#19298) Varun Sundar Rabindranath 2025-06-09 10:50:39 -04:00
  • b8089195b4 [v1] Add fp32 support to v1 engine through flex attn (#19319) Isotr0py 2025-06-09 22:10:44 +08:00
  • 770e5dcdb8 [full_graph] Fix query_start_loc padding (#19321) Yinghai Lu 2025-06-09 06:32:56 -07:00
  • c57c9415b1 [Docs] Fix a bullet list in usage/security.md (#19358) Michael Yao 2025-06-09 21:28:51 +08:00
  • 01810f9236 [CI] Introduce rules for llama auto-label (#19323) Lu Fang 2025-06-09 20:05:42 +08:00
  • 59abbd84f9 [Fix] Allow kernel compilation for CUDA capability 8.7 (#19328) Conroy Cheers 2025-06-09 19:57:23 +10:00
  • 95a6568b5c [CI/Build] Fix LoRA test (#19350) Jee Jee Li 2025-06-09 17:52:10 +08:00
  • 0eca5eacd0 [Doc] Fix description in the Automatic Prefix Caching design doc (#19333) Se7en 2025-06-09 17:30:02 +08:00
  • 12e5829221 [doc] improve ci doc (#19307) Reid 2025-06-09 15:26:12 +08:00
  • 3a4d417707 [Misc] Cleanup compilation tests (#19343) Richard Zou 2025-06-09 03:05:44 -04:00
  • 8335667c22 [Frontend] Remove unreachable code from llm.py (#19288) Kseniya Parkhamchuk 2025-06-08 21:22:10 -05:00
  • e1c4380d4c [Misc] Add documentation update reminder to PR template (#19289) Isotr0py 2025-06-09 10:20:53 +08:00
  • e31ae3de36 [Deprecation] Remove inputs arg fallback in Engine classes (#18799) Cyrus Leung 2025-06-09 10:19:56 +08:00
  • 2ffb9b6e07 [Bugfix] model_max_length should consider max_model_len in tokenizer_config (#19201) wang.yuqi 2025-06-08 22:17:53 +08:00
  • cda10fa3e2 [Multi Modal] Add an env var for message queue max chunk bytes (#19242) jennyyyyzhen 2025-06-08 06:39:12 -07:00
  • c123bc33f9 [Quantization] Add compressed-tensors NVFP4 support (#18312) Dipika Sikka 2025-06-08 06:05:55 -07:00
  • b9a1791e2c [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (#19082) Akash kaothalkar 2025-06-08 14:47:14 +05:30
  • 989dcee981 Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (#19315) Xu Wenqing 2025-06-08 16:07:02 +08:00
  • 3d64d366e0 [Misc] Change tests/compile to use VLLM_V1 by default (#19302) Richard Zou 2025-06-08 04:06:48 -04:00
  • eaa2e51088 [Bugfix] Re-enable use_cudagraph in vLLM v1 (#19299) Richard Zou 2025-06-07 20:56:12 -04:00
  • d77f7fb871 [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (#19283) Chauncey 2025-06-08 08:16:31 +08:00
  • 2d8476e465 [BugFix][V1] Fix memory profiling bug (#18974) Luka Govedič 2025-06-07 13:34:51 -04:00
  • 88be823d57 [AMD] Update compatible packaging version (#19309) pramenku 2025-06-07 18:25:09 +05:30
  • 4e4f63ad45 [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (#19311) Lifans 2025-06-07 03:25:38 -07:00
  • d2f0e7e615 [CI/Build] Improve Llama GGUF test robustness (#19287) Isotr0py 2025-06-07 17:23:28 +08:00
  • 122cdca5f6 [Misc] refactor context extension (#19246) Reid 2025-06-07 13:13:21 +08:00
  • cf02f9b283 Add FlexAttention to V1 (#16078) Driss Guessous 2025-06-07 00:58:55 -04:00
  • c4296b1a27 [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (#19253) Aaruni Aggarwal 2025-06-07 09:22:52 +05:30
  • 66c508b137 [TPU][Test] Add script to run benchmark on TPU for buildkite (#19039) QiliangCui 2025-06-06 20:10:24 -07:00
  • 84166fee97 [Kernel] Integrate CUTLASS MoE kernel with PPLX (#18762) ElizaWszola 2025-06-07 03:26:11 +02:00
  • 6e0cd10f72 [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (#19269) Lu Fang 2025-06-07 09:19:09 +08:00
  • e010688f50 [Build][ROCm] Update Dockerfile.rocm (#19296) Alexei-V-Ivanov-AMD 2025-06-06 18:35:16 -05:00
  • 441b65d8c7 [Misc][Tools][Benchmark] Fix and improve auto tune script (#19163) Chenyaaang 2025-06-06 16:31:19 -07:00
  • 46ecc57973 [BugFix] Fix tpu_model_runner block_id concatenation (#19228) Nick Hill 2025-06-06 16:28:17 -07:00