Commit Graph

  • ba464e6ae2 Add ORCA endpoint load metrics support (#24905) Misha Efimov 2025-11-03 03:21:31 -05:00
  • 7f4bdadb92 [XPU]Refine Dockerfile.xpu, avoid oneccl dependency issue (#27964) Kunshang Ji 2025-11-03 15:36:59 +08:00
  • cec7c28833 [Bugfix] Padded Eagle Specdec with Chunked Prefill (#26263) Rémi Delacourt 2025-11-03 08:22:46 +01:00
  • 18961c5ea6 [Hybrid] Pass kernel block size to builders (#27753) Thomas Parnell 2025-11-03 06:48:03 +01:00
  • 470ad118b6 [Frontend] Align finish_reason when tool is called with OpenAI (#25054) Sungyoon Jeong 2025-11-03 13:21:18 +09:00
  • 1bf43ae35d [BugFix][LoRA] use adapter_id instead of id field of lora_request (#27728) Biswa Panda 2025-11-02 18:08:08 -08:00
  • 0ce743f4e1 Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 (#27420) Vensen 2025-11-03 00:24:01 +08:00
  • 6c317a656e [Misc] Provide Siglip2 chat template (#27939) Cyrus Leung 2025-11-02 21:42:38 +08:00
  • 00b31a36a2 [V1] [Hybrid] Mamba1 Automatic Prefix Caching (#26377) Asaf Joseph Gardin 2025-11-02 14:16:23 +02:00
  • 73444b7b56 Performance fix MistralTokenizer: cache special ids and tokens (#27925) Julien Denize 2025-11-02 09:48:33 +01:00
  • 853a8eb53b [Bugfix] Fix Qwen Omni audio inference (#27920) Cyrus Leung 2025-11-02 13:06:05 +08:00
  • 758ea2e980 [CI/Build] Fix flaky test_transcription_validation.py::test_basic_audio_gemma (#27924) Ben Browning 2025-11-01 23:45:02 -04:00
  • 685c99ee77 [KV offload] Offloading connector async scheduling support (#27648) Yue Zhang 2025-11-02 05:08:56 +08:00
  • 1e88fb751b Adds anthropic /v1/messages endpoint to openai api_server (#27882) Benjamin Bartels 2025-11-01 19:45:42 +00:00
  • c2ed069b32 [BugFix] Fix mixed penalties batch with async scheduling (#27910) Nick Hill 2025-11-01 10:51:24 -07:00
  • af6e19f50f [Core][TPU] Support TPU Data Parallalism (#27365) wenxindongwork 2025-11-01 11:14:44 -06:00
  • 99d69af9ec [Bugfix] Python 3.10 compatibility for Self (#27918) Cyrus Leung 2025-11-01 23:28:54 +08:00
  • d811b442d3 [Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues (#26779) Haco 2025-11-01 22:52:43 +08:00
  • 30a14b034f [V0 deprecation] Remove VLLM_USE_V1 usage in platform and v1 module (#27798) wangxiyuan 2025-11-01 18:17:45 +08:00
  • 799ce45cc1 [Docs] Mock all imports for docs (#27873) Harry Mellor 2025-11-01 10:02:23 +00:00
  • 2c0c7c39bd feat(benchmarks): support HF model names in multi-turn benchmark (#27850) ai-jz 2025-11-01 01:04:52 -07:00
  • e675118849 [Add] cmdline argument parsing for KV cache offloading modules (#27621) Yihua Cheng 2025-11-01 00:17:07 -07:00
  • e2347dbf58 [Bugfix] [Model] Missing MRoPE function definition from KeyeForConditionalGeneration (#27895) TJian 2025-10-31 22:45:23 -07:00
  • 879a06579e [CI/Build] Bump transformers version (#27528) Cyrus Leung 2025-11-01 13:11:07 +08:00
  • 29de3cdee4 Adding SplitK in fused_moe_lora kernel (#27818) yugong333 2025-10-31 21:55:46 -07:00
  • 7e2729b57e [Multimodal][XPU]Enable vision attn backend for xpu platform (#27525) Yan Ma 2025-11-01 12:45:02 +08:00
  • 3a5de7d2d6 [Bugfix] Fix KDA output (#27905) Jee Jee Li 2025-11-01 11:54:36 +08:00
  • bc4486d609 [Kernel] Enable FusedMoEModularKernel support bias (#27754) Jee Jee Li 2025-11-01 10:05:12 +08:00
  • 0cdbe7b744 [Core] Async scheduling + structured outputs compatibility (#26866) Nick Hill 2025-10-31 17:35:04 -07:00
  • df334868ca [Hybrid] A simpler algorithm to find kernel_block_size (#26476) Chen Zhang 2025-10-31 14:30:28 -07:00
  • 0e0a638c3b Batch invariance doc (#27839) Bram Wasti 2025-10-31 17:22:19 -04:00
  • f29aeb5a25 Add FLASHINFER_MLA to test_mla_backends and add B200 CI run (#27663) Matthew Bonanni 2025-10-31 14:12:19 -04:00
  • 5e8862e9e0 [Feature] Pydantic validation for scheduler.py and structured_outputs.py (#26519) Vinay R Damodaran 2025-10-31 11:05:50 -07:00
  • 9e5bd3076e [Cleanup] Remove no-longer-used SpeculativeConfig.enable_chunked_prefill (#27826) Nick Hill 2025-10-31 10:57:45 -07:00
  • fc16f1c477 Flashinfer_CUTLASS_MOE fuses quantization for TP (#27223) Shu Wang 2025-10-31 10:54:29 -07:00
  • bc306fe5e9 fix incorrect type annotation in KimiMLP (#27885) ZiTian Zhao 2025-11-01 01:38:02 +08:00
  • 103a468bbf [bugfix] Missing cached item in beam search (#27874) Chenguang Zheng 2025-11-01 01:34:27 +08:00
  • 70bfbd7b16 Docs update tpu install instructions (#27824) Rob Mulla 2025-10-31 13:29:55 -04:00
  • d6517be3cd [Bugfix] Missing NIXL metadata for handshake initialization if instance spans multi-node (#26338) GuanLuo 2025-11-01 01:16:00 +08:00
  • 7e06c40e63 [Bugfix] Fix broken MRoPE for GLM-4.1V/GLM-4.5V (#27860) Isotr0py 2025-11-01 01:04:51 +08:00
  • 675704ac01 [Bugfix] Allow 64-bit integer values for LoRA IDs to avoid overflow/truncation (#27876) Madeesh Kannan 2025-10-31 17:58:42 +01:00
  • 0384aa7150 [CI/Build] Add gpt-oss LoRA test (#27870) Jee Jee Li 2025-10-31 22:17:21 +08:00
  • 3857eb8725 [Perf] Decouple torch op from GDA to leverage torch.compile (#27871) Jiangyun Zhu 2025-10-31 21:35:52 +08:00
  • 933cdea440 [BugFix] Don’t compute reorder threshold when there are no attention groups (#27861) Huamin Li 2025-10-31 04:36:18 -07:00
  • 3933f18a5e [Bugfix] Avoid too small block m/n for FlexAttention kernel option (#27853) Isotr0py 2025-10-31 19:33:12 +08:00
  • e5ef4dfc11 [Kimi-Linear] Correct prefixes and add compatibility to AWQ quants (#27834) toncao 2025-10-31 16:36:37 +07:00
  • 36960501d3 [Hardware][Powerpc] Fix VLLM_CPU_OMP_THREADS_BIND="auto" low CPU utilization for Power (#27734) Akash kaothalkar 2025-10-31 13:15:26 +05:30
  • b2e65cb4a7 [benchmark] Make request IDs unique across clients by default (#27723) Seiji Eicher 2025-10-30 19:40:35 -05:00
  • 2bf0bcc1fc [CI Test] Add Scheduled Integration Test (#27765) Wentao Ye 2025-10-30 20:29:26 -04:00
  • 697f507a8e [CI/Build][Intel] Enable performance benchmarks for Intel Gaudi 3 (#26919) Jakub Sochacki 2025-10-31 00:57:22 +01:00
  • d5d2a0fe74 [Misc] Make all tool scripts executable (#27831) Matthew Bonanni 2025-10-30 19:46:02 -04:00
  • c9791f1813 [BugFix] Fix broken import in initialize_ray_cluster() (#27838) Nick Hill 2025-10-30 16:26:13 -07:00
  • e7acb20076 [Feature] Batch invariant torch.compile (#27660) Paul Zhang 2025-10-30 16:11:29 -04:00
  • 4b68c4a55b [Core][Perf] Only invoke save_new_computed_blocks when computed blocks are not empty (#27799) Jialin Ouyang 2025-10-30 12:47:30 -07:00
  • a8141fa649 [Refactor] Remove VLLM_DEEPEP_LOW_LATENCY_ALLOW_NVLINK (#27750) Wentao Ye 2025-10-30 15:32:39 -04:00
  • 4917002523 [Fix] Skip record_sleep_state logic in PrometheusStatsLogger if not in dev mode (#27789) Sumanth R Hegde 2025-10-30 12:26:27 -07:00
  • a2981c4272 [EP/DP][API Server] Enable DP-aware routing in OpenAI API requests (#24945) cong-meta 2025-10-30 12:10:16 -07:00
  • 4574d48bab [Core][Bookkeeping] Update cu_num_accepted_tokens for all req_index (#27629) Jialin Ouyang 2025-10-30 11:52:36 -07:00
  • ab98f6556f [Bugfix] Fix 2 precommit issues - (mamba_block_size, kv_cache_config) (#27811) Tyler Michael Smith 2025-10-30 14:52:18 -04:00
  • 2918c1b49c [Model] Use the same fused_moe configs for all H200 devices (#23642) v0.11.1rc5 Roger Meier 2025-10-31 01:36:56 +08:00
  • 1004205795 [MTP] Refactor mtp predictor to avoid d2h operation (#27643) Mengqing Cao 2025-10-31 01:27:39 +08:00
  • ba33e8830d Reapply "Install pre-built xformers-0.0.32.post2 built with pt-2.9.0" (#27768) Huy Do 2025-10-30 10:22:30 -07:00
  • 33a0ea5f32 [Docs] add Shanghai Meetup - 2025/10 (#27545) Kebe 2025-10-31 01:33:13 +09:00
  • 60f76baa66 [Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices (#27564) Ilya Markov 2025-10-30 16:41:44 +01:00
  • e5e076cad7 [BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP (#27762) Varun Sundar Rabindranath 2025-10-30 11:24:31 -04:00
  • eebf00cb0c [Bugfix][CPU] Fix MRoPE dispatch on the CPU backend (#27800) Li, Jiang 2025-10-30 23:12:05 +08:00
  • 9956aae4ea [Model][Ouro] Support Ouro Model (#27794) Fan Yin 2025-10-30 22:34:41 +08:00
  • 0fe0140408 [KV offload] Enable CPU KV offload on CUDA alike Platforms (#27770) Zhewen Li 2025-10-30 07:10:29 -07:00
  • 4e68cc9b6a [Model] Introduce Kimi Linear to vLLM (#27809) Zhiyuan Li 2025-10-30 21:02:27 +08:00
  • 1994de99ea [CI Failure] Fix test_kv_cache_model_load_and_run (#27717) Huamin Li 2025-10-30 05:27:53 -07:00
  • 4464723f22 [Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524) wang.yuqi 2025-10-30 20:13:05 +08:00
  • 74374386e2 [Bugfix] Improve GPU validation logging in Ray fallback scenarios (#25775) Sairam Pillai 2025-10-30 17:27:59 +05:30
  • c01f6e525f [CI] Fix mypy for vllm/v1/core and vllm/v1/engine (#27108) Wentao Ye 2025-10-30 07:32:17 -04:00
  • c7d2a554ba [CI Failure] fix test_default_mm_loras (#27795) Huamin Li 2025-10-30 03:13:03 -07:00
  • af826e0820 [V0 deprecation] Remove VLLM_USE_V1 usage in config module (#27784) wangxiyuan 2025-10-30 17:42:49 +08:00
  • e806178d2a [BugFix][VL] Fix FA selection on Qwen2.5-VL (#27790) Zhewen Li 2025-10-30 00:54:44 -07:00
  • 5be1bed790 [CI/Build]Add eval config for Qwen3-235B-A22B-Instruct-2507-FP8 (#27113) Huamin Li 2025-10-30 00:50:56 -07:00
  • 31b55ffc62 use stringData in secret yaml to store huggingface token (#25685) yitingdc 2025-10-30 15:47:36 +08:00
  • ded8ada86a Add more dims for batch invariant shims (#27489) Bram Wasti 2025-10-30 01:28:45 -04:00
  • 8bff831f0a [Benchmark] Cleanup deprecated nightly benchmark and adjust the docstring for performance benchmark (#25786) Kuntai Du 2025-10-29 21:43:37 -07:00
  • b5d70751d8 [BugFix] Reordering extend logic fix (#27739) Lucas Wilkinson 2025-10-30 12:39:34 +08:00
  • b8c48c5d72 kernels/moe test pruning (#27053) Fardin Hoque 2025-10-29 21:10:34 -07:00
  • 17d055f527 [Feat] Adds runai distributed streamer (#27230) Benjamin Bartels 2025-10-30 04:09:10 +00:00
  • 2ce5c5d3d6 [BugFix] Handle unscheduled requests properly when async scheduling (#27756) Nick Hill 2025-10-29 21:04:25 -07:00
  • b5bae42f91 [XPU] Update latest IPEX 2.8 release (#27735) Kunshang Ji 2025-10-30 11:17:13 +08:00
  • d7fb10c574 [Bugfix] mamba-block-size is set for vision language model (#27773) Chen Zhang 2025-10-29 19:39:57 -07:00
  • b798e39f93 [XPU][bugfix] fix rope for llama4 and deepseek (#25145) Yan Ma 2025-10-30 09:43:13 +08:00
  • 48eb8eba58 [Temp fix] Disable torch.compile for Qwen2.5 VL's VisionBlock temporarily. (#27760) Chenheli Hua 2025-10-29 16:17:48 -07:00
  • b5d90f7400 [Bug] Fix DBO IMA issue for DeepEPHT (#27666) Wentao Ye 2025-10-29 16:28:27 -04:00
  • d4aa144343 [BugFix] Fix handling of resumed reqs in SharedStorageConnector (#27719) Nick Hill 2025-10-29 13:16:52 -07:00
  • fcb1d570bb [Bug] Fix DeepEP low latency assert self.batched_router_logits.size(-1) == full_router_logits.size(-1) Bug (#27682) Wentao Ye 2025-10-29 14:50:39 -04:00
  • accb8fab07 [KVConnector] Add metrics to Prometheus-Grafana dashboard (#26811) Nicolò Lucchesi 2025-10-29 19:44:49 +01:00
  • 5b0448104f [Bug] Raise error explicitly if using incompatible backend (#27424) Wentao Ye 2025-10-29 13:29:20 -04:00
  • f7a6682872 [CI/Build] Test torchrun with 8 cards (#27548) 22quinn 2025-10-29 10:26:06 -07:00
  • a9fe0793f2 use_aot_compile should respect VLLM_DISABLE_COMPILE_CACHE (#27698) Boyuan Feng 2025-10-29 10:08:54 -07:00
  • 7568a282b9 [FIXBUG] Qwen3VL hallucinations without Contiguous on Torch.SDPA (#27744) JartX 2025-10-29 17:55:35 +01:00
  • 1da3309ace [Core] Exposing engine sleep & wake_up state as prometheus metrics (#24176) Braulio Dumba 2025-10-29 12:32:01 -04:00
  • 5522fb274b [Chore] Optimize P2PNCCLEngine http_address (#27488) Wentao Ye 2025-10-29 12:05:09 -04:00
  • 0f95a1c3f2 [CI] Fix flaky test_two_responses_with_same_prev_id test (#27745) Nicolò Lucchesi 2025-10-29 16:10:35 +01:00
  • ded24e3e54 [ROCm][Platform] Add MI308X device id in _ROCM_DEVICE_ID_NAME_MAP (#27623) Xiake Sun 2025-10-29 22:44:03 +08:00