Commit Graph

  • 2dec7c1a5d [Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported (#21420) elvischenv 2025-07-23 11:34:50 +08:00
  • 08d2bd78da [BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update (#21414) Chendi.Xue 2025-07-22 22:33:57 -05:00
  • 4f76a05f4f [BugFix] Update python to python3 calls for image; fix prefix & input calculations. (#21391) ericehanley 2025-07-22 22:33:00 -05:00
  • f154bb9ff0 Simplify weight loading in Transformers backend (#21382) Harry Mellor 2025-07-23 04:29:43 +01:00
  • 3ec7170ff1 [Bugfix][ROCm][Build] Fix build regression on ROCm (#21393) Gregory Shtrasberg 2025-07-22 23:27:41 -04:00
  • c401c64b4c [CI/Build] Fix model executor tests (#21387) Cyrus Leung 2025-07-23 11:25:37 +08:00
  • b77c7d327f [BugFix] Fix ray import error mem cleanup bug (#21381) Joe Runde 2025-07-22 17:19:55 -06:00
  • 35bc8bd5fb [Misc] Copy HF_TOKEN env var to Ray workers (#21406) Rui Qiao 2025-07-22 16:18:42 -07:00
  • 4594fc3b28 [Model] Add Qwen3CoderToolParser (#21396) Yiheng Xu 2025-07-22 15:05:57 -07:00
  • ae268b6326 Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num (#21325) Xin Li 2025-07-22 15:42:31 -04:00
  • 35366ae57c [CI/Build] Fix test failure due to updated model repo (#21375) Cyrus Leung 2025-07-22 23:39:35 +08:00
  • 2226d5bd85 [Bugfix] Decode Tokenized IDs to Strings for hf_processor in llm.chat() with model_impl=transformers (#21353) Aritra Roy Gosthipaty 2025-07-22 20:57:28 +05:30
  • 44554a0068 Add tokenization_kwargs to encode for embedding model truncation (#21033) Wang Yijun 2025-07-22 23:24:00 +08:00
  • 226b452a20 Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" (#21384) Wentao Ye 2025-07-22 11:22:10 -04:00
  • f38ee34a0a [feat] Enable mm caching for transformers backend (#21358) Raushan Turganbay 2025-07-22 17:18:46 +02:00
  • b194557a6c Adds parallel model weight loading for runai_streamer (#21330) Benjamin Bartels 2025-07-22 16:15:53 +01:00
  • 774d0c014b [Perf] Cuda Kernel for Per Token Group Quant (#21083) Wentao Ye 2025-07-22 10:27:15 -04:00
  • 2c8db17cfd [feat]: add SM100 support for cutlass FP8 groupGEMM (#20447) Duncan Moss 2025-07-22 07:27:12 -07:00
  • 4fb56914c5 [perf] Add fused MLA QKV + strided layernorm (#21116) Mickaël Seznec 2025-07-22 16:07:44 +02:00
  • 0df4d9b06b [Misc] unify variable for LLM instance v2 (#21356) Ning Xie 2025-07-22 21:32:36 +08:00
  • ed25054577 [Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool (#21222) Jialin Ouyang 2025-07-22 06:17:47 -07:00
  • 10904e6d75 [benchmark] Port benchmark request sent optimization to benchmark_serving (#21209) Jialin Ouyang 2025-07-22 05:28:00 -07:00
  • a32237665d [Core] Optimize update checks in LogitsProcessor (#21245) Jialin Ouyang 2025-07-22 05:27:18 -07:00
  • bc8a8ce5ec [Misc] Remove deprecated args in v0.10 (#21349) Kebe 2025-07-22 20:26:39 +08:00
  • 32142b3c62 [Bugfix] Fix eviction cached blocked logic (#21357) Simon Mo 2025-07-22 01:18:40 -07:00
  • 82b8027be6 Add arcee model (#21296) Raghav Ravishankar 2025-07-22 13:27:43 +05:30
  • 3779eb8c81 [Feature][eplb] add verify ep or tp or dp (#21102) rongfu.leng 2025-07-22 14:41:14 +08:00
  • 9e23ad9655 Update fp4 quantize API (#21327) Shu Wang 2025-07-22 01:40:21 -05:00
  • e69a92a1ce [Bug] DeepGemm: Fix Cuda Init Error (#21312) Wentao Ye 2025-07-22 02:36:18 -04:00
  • 8425f785ad [Misc] DeepEPHighThroughtput - Enable Inductor pass (#21311) Varun Sundar Rabindranath 2025-07-22 12:05:45 +05:30
  • c17231e827 Fix kv_cache_dtype handling for out-of-tree HPU plugin (#21302) Konrad Zawora 2025-07-22 08:35:14 +02:00
  • 6e5b5ca580 [Refactor] Fix Compile Warning #1444-D (#21208) Wentao Ye 2025-07-22 02:33:51 -04:00
  • 488d8a986a [V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible (#21300) Thomas Parnell 2025-07-22 08:31:18 +02:00
  • af376ca19d [Core] Minimize number of dict lookup in _maybe_evict_cached_block (#21281) Jialin Ouyang 2025-07-21 22:37:34 -07:00
  • e7b2042681 Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) (#21334) Ming Yang 2025-07-21 21:49:01 -07:00
  • 90f1e55421 [Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU (#21338) Ratnam Parikh 2025-07-21 21:48:27 -07:00
  • 5e70dcd6e6 [Doc] Fix CPU doc format (#21316) Li, Jiang 2025-07-22 12:47:49 +08:00
  • 25d585ab7b [XPU] Enable external_launcher to serve as an executor via torchrun (#21021) Chaojun Zhang 2025-07-22 12:47:35 +08:00
  • 8d0a01a5f2 [v1][sampler] Inplace logprobs comparison to get the token rank (#21283) Lu Fang 2025-07-21 13:47:47 -07:00
  • 0ec82edda5 [perf] Speed up align sum kernels (#21079) Himanshu Jaju 2025-07-21 19:19:23 +01:00
  • 005ae9be6c Fix bad lm-eval fork (#21318) Michael Goin 2025-07-21 13:47:51 -04:00
  • 29d1ffc5b4 [DP] Fix Prometheus Logging (#21257) Robert Shaw 2025-07-21 12:11:35 -04:00
  • 304dce7ec0 [Attention] Clean up iRoPE in V1 (#21188) Lucas Wilkinson 2025-07-21 12:10:30 -04:00
  • 6ece16c4fe [Misc] Add dummy maverick test (#21199) Ming Yang 2025-07-21 09:08:09 -07:00
  • a0e827e07c [BugFix] make utils.current_stream thread-safety (#21252) (#21253) simpx 2025-07-22 00:07:36 +08:00
  • a15a50fc17 [CPU] Enable shared-memory based pipeline parallel for CPU backend (#21289) Li, Jiang 2025-07-22 00:07:08 +08:00
  • 6dda13c86b [Misc] Add sliding window to flashinfer test (#21282) Woosuk Kwon 2025-07-21 08:37:49 -07:00
  • 6b46c4b653 Add Nvidia ModelOpt config adaptation (#19815) Zhiyu 2025-07-21 07:02:58 -07:00
  • d97841078b [Misc] unify variable for LLM instance (#20996) Ning Xie 2025-07-21 19:18:33 +08:00
  • e6b90a2805 [Docs] Make tables more space efficient in supported_models.md (#21291) Harry Mellor 2025-07-21 10:25:02 +01:00
  • be54a951a3 [Docs] Fix hardcoded links in docs (#21287) Harry Mellor 2025-07-21 10:23:57 +01:00
  • 042af0c8d3 [Model][1/N] Support multiple poolers at model level (#21227) Cyrus Leung 2025-07-21 17:22:21 +08:00
  • 378d33c392 [Bugfix] Fix missing placeholder in logger debug (#21280) Cyrus Leung 2025-07-21 13:50:06 +08:00
  • 940af1f03a Add the instruction to run e2e validation manually before release (#21023) Huy Do 2025-07-20 22:29:18 -07:00
  • 92615d7fe8 [Docs] Add RFC Meeting to Issue Template (#21279) Simon Mo 2025-07-20 21:58:07 -07:00
  • 8188196a1c [CI] Cleanup modelscope version constraint in Dockerfile (#21243) Kay Yan 2025-07-21 11:13:02 +08:00
  • 7ba34b1241 [bugfix] fix syntax warning caused by backslash (#21251) Jiayi Yan 2025-07-21 01:12:10 +08:00
  • 9499e26e2a [Model] Support VLMs with transformers backend (#20543) Raushan Turganbay 2025-07-20 15:25:50 +02:00
  • 51ba839555 [Model] use AutoWeightsLoader for bart (#18299) Calvin Chen 2025-07-20 16:15:50 +08:00
  • d1fb65bde3 Enable v1 metrics tests (#20953) v0.10.0rc1 Seiji Eicher 2025-07-19 20:22:02 -07:00
  • 3a1d8940ae [TPU] support fp8 kv cache quantization (#19292) Chengji Yao 2025-07-19 20:01:00 -07:00
  • 2b504eb770 [Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models. (#21233) Thomas Parnell 2025-07-20 01:09:58 +02:00
  • 10eb24cc91 GLM-4 Update (#20736) Yuxuan Zhang 2025-07-20 06:40:31 +08:00
  • 2e8cbb58f3 [BugFix] Fix full cuda graph slot_mapping (#21228) fhl2000 2025-07-20 05:13:18 +08:00
  • 752c6ade2e [V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small (#21217) Woosuk Kwon 2025-07-19 13:53:17 -07:00
  • 881e3cbe3b [V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers (#21194) Thomas Parnell 2025-07-19 21:27:21 +02:00
  • 9f414a12ad [BugFix] Make PD work with Ray (#21072) kourosh hakhamaneshi 2025-07-19 08:46:50 -07:00
  • 6a971ed692 [Docs] Update the link to the 'Prometheus/Grafana' example (#21225) Jiayi Yan 2025-07-19 21:58:07 +08:00
  • da6579bf41 [CI/CD][bugfix]fix: error argument to loads has incompatible type (#21223) Sungjae Lee 2025-07-19 21:16:48 +09:00
  • c81259d33a Fix/remove some broken model executor tests (#21224) Rabi Mishra 2025-07-19 17:45:07 +05:30
  • e3a0e43d7f [bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code (#21032) Li, Jiang 2025-07-19 20:13:55 +08:00
  • b3d82108e7 [Bugfix][Frontend] Fix openai CLI arg middleware (#21220) 22quinn 2025-07-19 02:40:38 -07:00
  • 6d0734c562 [NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency (#20645) Kaixi Hou 2025-07-19 02:33:01 -07:00
  • 7d94577138 Add torch golden impl for moe_align_block_size kernel test (#20653) shixianc 2025-07-19 02:32:36 -07:00
  • 59f935300c [BugFix] Fix potential cuda-graph IMA (#21196) Lucas Wilkinson 2025-07-19 05:18:47 -04:00
  • 18e519ec86 [Bugfix] Fix ndarray video color from VideoAsset (#21064) Isotr0py 2025-07-19 17:17:16 +08:00
  • 1eaff27815 [V0 deprecation] Remove long context LoRA (#21169) Jee Jee Li 2025-07-19 17:15:41 +08:00
  • cf8cc32674 Fix a couple of Voxtral tests (#21218) Huy Do 2025-07-19 02:13:41 -07:00
  • 3a2cb2649d [Misc][Tools][Benchmark] Add readme file for auto_tune script (#20779) Chenyaaang 2025-07-19 02:06:59 -07:00
  • 3e04107d97 [Model] EXAONE 4.0 model support (#21060) 김종곤 2025-07-19 15:25:44 +09:00
  • 37bd8d6e4c [Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 required positional argument: 'use_ue8m0' for SM100 (#21187) Wentao Ye 2025-07-19 02:25:22 -04:00
  • 468e2400fe [BugFix][CPU] Fix TorchSDPABackendImpl doesn't have use_irope (#21200) Lucas Wilkinson 2025-07-19 02:18:48 -04:00
  • dcc6cfb991 [Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel (#21193) Varun Sundar Rabindranath 2025-07-19 11:39:51 +05:30
  • dd572c0ab3 [V0 Deprecation] Remove V0 Spec Decode workers (#21152) Woosuk Kwon 2025-07-18 21:47:50 -07:00
  • 9ffe905a41 [Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503 (#21183) Varun Sundar Rabindranath 2025-07-19 09:45:03 +05:30
  • 9a9fda1423 [Core] Support Local Chunked Attention for Hybrid KV Cache (#19351) Lucia Fang 2025-07-19 11:48:38 +08:00
  • 466e878f2a [Quantization] Enable BNB support for more MoE models (#21100) Jee Jee Li 2025-07-19 08:52:02 +08:00
  • 217937221b Elastic Expert Parallel Initial Support (#20775) Rui Qiao 2025-07-18 17:46:09 -07:00
  • 5782581acf [Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) (#21077) hax0r31337 2025-07-19 00:40:18 +02:00
  • 0f199f197b [Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue (#21005) JialinOuyang-Meta 2025-07-18 12:34:40 -07:00
  • b2eb2b5ad7 [Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 (#19346) Richard Zou 2025-07-18 14:10:21 -04:00
  • 21274ab476 [CI] Update CODEOWNERS for vllm/compilation (#21185) Richard Zou 2025-07-18 09:51:12 -04:00
  • ed8cbfedf8 Let GraniteMoeAttention use YaRN (#21174) Thomas Parnell 2025-07-18 14:52:52 +02:00
  • 45badd05d0 [Core] Set pooling params based on task and model (#21128) Cyrus Leung 2025-07-18 20:41:17 +08:00
  • 4adc66f64d [Bugfix] Allocate less memory in non-batched CUTLASS MoE (#21121) ElizaWszola 2025-07-18 12:55:52 +02:00
  • 55ad648715 [Doc] Fix typo in model name (#21178) Cyrus Leung 2025-07-18 18:55:10 +08:00
  • 5895afd780 [Bugfix] The special_tokens in tokenizer should also be controlled by do_lower_case in encoder_config. (#20750) wang.yuqi 2025-07-18 17:10:47 +08:00
  • ca4eb82bcb [Model] Re-add the implicit conversion feature for as_seq_cls_model (#21103) wang.yuqi 2025-07-18 15:15:07 +08:00
  • ba2dfbb0c2 [Misc] Make MM embedding merge interface explicit in model runner (#21147) Roger Wang 2025-07-18 00:13:57 -07:00
  • 1bf65138f6 [benchmark] Sending request strictly follows the random intervals (#21108) Jialin Ouyang 2025-07-17 23:22:08 -07:00