Commit Graph

  • 6fad29b11b Remove graph_pool as member of VllmBackend and argument to CUDAGraphWrapper (#23385) Copilot 2025-08-25 19:34:15 -07:00
  • 6fd45e7b8a [CI/Build] Use vLLM client's user agent to fetch images (#23561) Cyrus Leung 2025-08-26 10:34:12 +08:00
  • 56dcf4e7e9 [Bug] Fix DeepGEMM Env Control (#23591) Wentao Ye 2025-08-25 21:41:21 -04:00
  • ae067888d6 Update Flashinfer to 0.2.14.post1 (#23537) weiliang 2025-08-26 09:30:44 +08:00
  • 906e461ed6 [CI Fix] Pin deepep and pplx tags in tools/ep_kernels/, gate multigpu tests (#23568) Michael Goin 2025-08-25 21:29:00 -04:00
  • 2a97ffc33d [Misc] Add release note draft to PR template (#23598) Simon Mo 2025-08-25 16:44:51 -07:00
  • efc88cf64a [Misc] Simplify FlashInfer attention metadata (#23585) Woosuk Kwon 2025-08-25 15:42:29 -07:00
  • 7b6a837275 [Docs] Update Documentation of Cohere Command-A Models (#23584) Terrence Zhao 2025-08-25 17:53:52 -04:00
  • c34c82b7fe [TPU][Bugfix] Fixes prompt_token_ids error in tpu tests. (#23574) Pate Motter 2025-08-25 14:29:16 -07:00
  • 8a044754bd [XPU] Delay BF16 check to worker init for spawn compatibility (#22979) Chaojun Zhang 2025-08-26 04:09:26 +08:00
  • 9188ae7cb5 [Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector (#23403) Zhonghua Deng 2025-08-26 03:57:08 +08:00
  • 8a3cd90af5 [Kernel] Add fused grouped_topk kernel for MoE (#23274) Xin Yang 2025-08-25 11:47:52 -07:00
  • 2a167b2eeb [test][RL] Add sleep level 2 test and fix reload with sleep mode (#23521) 22quinn 2025-08-25 09:25:52 -07:00
  • 0ff902f3b4 [Refactor] Refactor persistent buffers with CpuGpuBuffer (#23515) Woosuk Kwon 2025-08-25 08:44:48 -07:00
  • a9082a4d14 [Bugfix] Fix Qwen3 MoE GPTQ inference (#23490) Isotr0py 2025-08-25 21:40:20 +08:00
  • e0329ed4b4 Updates to Flex + VLLm integration (#21416) Driss Guessous 2025-08-25 06:32:42 -07:00
  • 6879cd80ae [Refactor] Pass tokenizer explicitly instead of binding to prompt update (#23542) Cyrus Leung 2025-08-25 21:31:57 +08:00
  • e269be2ba2 [Doc] Add caution for API server scale-out (#23550) Cyrus Leung 2025-08-25 21:14:15 +08:00
  • 5c4b6e66fe [Attention] Unify mamba and attention backend selection (#23171) Ayush Satyam 2025-08-25 14:39:36 +05:30
  • d0a4a3f645 [misc] add shanghai meetup (#23535) youkaichao 2025-08-25 17:00:03 +08:00
  • ebafb0936d [Bugfix] Allow dynamic number of patches for llava_onevision (#23525) Cyrus Leung 2025-08-25 16:34:54 +08:00
  • 0cb7b065c3 Feature/benchmark/random mm data/images (#23119) Breno Baldas Skuk 2025-08-25 10:28:35 +02:00
  • 2da02dd0d8 [Fix] DeepSeek V3.1 tool parser error message (#23492) ZiTian Zhao 2025-08-25 15:56:39 +08:00
  • d765cf01fe [Core][Multimodal] Track encode cache entries by mm_hash and enable embedding sharing between requests (#22711) Chenguang Zheng 2025-08-25 15:41:17 +08:00
  • 712d0f88d8 [Refactor] Dynamic target and content for prompt updates (#23411) Cyrus Leung 2025-08-25 14:39:58 +08:00
  • 49ab23b3cc [gpt-oss] use reasoning channel for reasoning text in serving_chat (#22920) Yu Guo 2025-08-24 23:29:34 -07:00
  • c9abb10489 [Bugfix] Fix Dense module loading for sentence-transformers embedding models (simplified V2) (#23408) LIYIFAN_liyifan 2025-08-24 22:39:24 -07:00
  • 787cdb3829 Migrate DonutImagePixelInputs to TensorSchema (#23509) Benji Beck 2025-08-24 22:02:15 -07:00
  • a5203d04df Migrate skyworkr1v inputs to TensorSchema (#23499) Benji Beck 2025-08-24 21:43:21 -07:00
  • 99f8094400 Migrate tarsier inputs to TensorSchema (#23500) Benji Beck 2025-08-24 21:42:36 -07:00
  • 170e8ea9ea [Misc] Unified linear print info (#23516) Jee Jee Li 2025-08-25 11:13:51 +08:00
  • a71e4765cc [Bugfix] Fix Qwen2.5-VL quantized model weights loading (#23512) zifeitong 2025-08-24 19:40:22 -07:00
  • 39971db3aa Frontend: Adding LM Format Enforcer support to V1 engine (#22564) Noam Gat 2025-08-25 05:31:22 +03:00
  • 504d914314 [Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 (#23504) Ming Yang 2025-08-24 18:06:35 -07:00
  • 47455c424f [Doc: ]fix various typos in multiple files (#23487) Didier Durand 2025-08-25 02:04:04 +02:00
  • c7fc6b1354 fix incompatibililty with non cuda platform for nvfp4 (#23478) Lucia Fang 2025-08-24 15:35:41 -07:00
  • ad78868450 [Misc] Remove unused slot_mapping buffer (#23502) Woosuk Kwon 2025-08-24 14:03:36 -07:00
  • e2db1164a1 [Model] Enable BLOOM on V1 (#23488) Cyrus Leung 2025-08-24 21:30:47 +08:00
  • 416f05929a [New Model]Donut model (#23229) 汪志鹏 2025-08-24 20:52:24 +08:00
  • 5e021b4981 (Misc): add missing test for zero truncation size. (#23457) TeeKen Lau 2025-08-24 20:12:47 +10:00
  • 1b9b16649c [Misc] update dict parse to EPLBConfig from json dumps to dict unpacking (#23305) rongfu.leng 2025-08-24 16:06:34 +08:00
  • e76e233540 [kernel] Support W4A8 on Hopper (#23198) czhu-cohere 2025-08-24 02:18:04 -04:00
  • a75277285b Migrate Paligemma inputs to TensorSchema (#23470) Benji Beck 2025-08-23 21:56:56 -07:00
  • 9dc30b7068 [Bugfix] Add strong reference to CUDA pluggable allocator callbacks (#23477) 22quinn 2025-08-23 21:56:17 -07:00
  • 053278a5dc Migrate Pixtral inputs to TensorSchema (#23472) Benji Beck 2025-08-23 21:55:53 -07:00
  • c55c028998 [gpt-oss] Streaming Output for Python Tool (#23409) Jiangyun Zhu 2025-08-24 12:42:38 +08:00
  • 65197a5fb3 [Misc] Modify CacheConfig import (#23459) Jee Jee Li 2025-08-23 14:05:27 +08:00
  • b8f17f5d98 Support DeepSeek-V3.1 tool call (#23454) Xu Wenqing 2025-08-23 13:50:16 +08:00
  • d9a55204ba fix(tests): Correct unreachable assertion in truncation test (#23425) Aziz 2025-08-23 07:23:54 +02:00
  • b4e9fd811f Revert "[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion (#20000)" (#23396) Cyrus Leung 2025-08-23 12:16:48 +08:00
  • 308fa287a8 Add glm4.5v tp2,4 fp8 config on H100_80GB (#23443) Chenxi Yang 2025-08-22 19:54:19 -07:00
  • fa78de9dc3 Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs (#22527) Daifeng Li 2025-08-23 10:53:21 +08:00
  • f6818a92cb [UX] Move Dockerfile DeepGEMM install to tools/install_deepgemm.sh (#23360) Michael Goin 2025-08-22 22:52:50 -04:00
  • 23c939fd30 [Model] Support DP for ViT on MiniCPM-V-4 (#23327) WeiQing Chen 2025-08-23 10:14:41 +08:00
  • add1adfec7 [BugFix] Fix MinPLogitsProcessor.update_states() (#23401) Nick Hill 2025-08-22 17:22:11 -07:00
  • c80c53a30f [BugFix] Fix batch updates for pooling models (#23398) Nick Hill 2025-08-22 17:20:41 -07:00
  • 24d0c9e6ed [NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel (#22703) elvischenv 2025-08-23 06:09:05 +08:00
  • cc7ae5e7ca [BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ not being called when it should when using quantized FP8 model (#22281) rasmith 2025-08-22 16:47:57 -05:00
  • 0313cf854d [PERF] PyTorch Symmetric Memory All-Reduce (#20759) Ilya Markov 2025-08-22 23:39:08 +02:00
  • 0483fabc74 [CI/Build] add EP dependencies to docker (#21976) Zhewen Li 2025-08-22 13:34:40 -07:00
  • da65bec309 add an env var for path to pre-downloaded flashinfer cubin files (#22675) Shiyan Deng 2025-08-22 12:25:45 -07:00
  • 4645024d3a [Quantization] Allow GGUF quantization to skip unquantized layer (#23188) Isotr0py 2025-08-23 03:04:22 +08:00
  • cd7a3df26f [Bugfix] Fix broken Florence-2 model (#23426) Isotr0py 2025-08-23 01:50:52 +08:00
  • 32d2b4064f [Model] Add Ovis2.5 PP support (#23405) Isotr0py 2025-08-23 01:46:34 +08:00
  • 22cf679aad [Doc]: fix various typos in multiple files (#23179) Didier Durand 2025-08-22 19:38:46 +02:00
  • b6d7d34fc6 Add unit tests for batched guided and non-guided requests (#23389) Yong Hoon Shin 2025-08-22 10:31:24 -07:00
  • 341923b982 fix(tests): Ensure reliable CUDA cache clearing in MoE test (#23416) Aziz 2025-08-22 19:20:59 +02:00
  • 424fb7a5d2 [BugFix] Fix the issue where image embeddings were incorrectly split.… (#23366) bppps 2025-08-23 00:56:46 +08:00
  • 88491c1b6b [Speculators][Speculative Decoding] Fix Qwen 2 Eagle3 Support (#23337) PapaGoose 2025-08-22 19:39:19 +03:00
  • 613a23b57f [Bugfix]: Installing dev environment due to pydantic incompatible version (#23353) Martin Hickey 2025-08-22 17:22:29 +01:00
  • 51a215300b [Fix] Bump triton version in rocm-build requirements (#21630) Burkhard Ringlein 2025-08-22 17:13:39 +02:00
  • ebe14621e3 [Bug fix] Dynamically setting the backend variable for genai_perf_tests in the run-nightly-benchmark script (#23375) Naman Lalit 2025-08-22 08:12:28 -07:00
  • 325aa3dee9 [Misc] local import code clean (#23420) Ning Xie 2025-08-22 22:01:35 +08:00
  • a073be6d87 [Doc] Update the doc for log probs + prefix caching (#23399) Chen Zhang 2025-08-22 06:20:39 -07:00
  • 695e7adcd2 [misc] Remove outdate comment about runai_model_streamer (#23421) 杨朱 · Kiki 2025-08-22 21:08:53 +08:00
  • 281710ef9a [Attention] Allow V1 flash_attn to support cross-attention (#23297) Russell Bryant 2025-08-22 08:10:16 -04:00
  • 808d2e9aa0 [Misc] Move M-RoPE init logic to _init_mrope_positions (#23422) Woosuk Kwon 2025-08-22 03:07:22 -07:00
  • 285178b3b8 [V0 Deprecation] Remove V0 LoRA test (#23418) Jee Jee Li 2025-08-22 17:56:51 +08:00
  • 88016c372a [Bugfix] Fix pooling models on CPU backend (#23392) Li, Jiang 2025-08-22 17:47:17 +08:00
  • 998720859c Migrate MiniCPMOAudioInputs to TensorSchema (#21847) Benji Beck 2025-08-22 01:43:29 -07:00
  • 0ba1b54ac6 [gpt-oss] add input/output usage in responses api when harmony context is leveraged (#22667) Guillaume Calmettes 2025-08-22 10:32:24 +02:00
  • 53415653ff [P/D][Nixl] Make kv cache register compatible with hybrid memory allocator (#23079) Flora Feng 2025-08-21 22:30:48 -07:00
  • 17373dcd93 [Attention] Refactor AttentionMetadata Preparation for Encoder-only Models (#23154) Chen Zhang 2025-08-21 22:05:59 -07:00
  • 5964069367 [New Model] Add Seed-Oss model (#23241) Bin Jia 2025-08-22 12:58:10 +08:00
  • de9c085e17 [Misc] Add gemma3 chat template with pythonic-style function calling (#17149) Philip Chung 2025-08-21 21:06:50 -07:00
  • 111692bb8c [CI] Add end-to-end V1 min_tokens test coverage (#22495) Arjun Reddy 2025-08-21 23:04:07 -05:00
  • 394591e343 [Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improvement (#23351) Wentao Ye 2025-08-22 00:01:08 -04:00
  • 3ac849665d [CI/Build] Skip Idefics3 and SmolVLM generation test again (#23356) Isotr0py 2025-08-22 11:39:46 +08:00
  • 0b9cc56fac Migrate MllamaImagePixelInputs to TensorSchema (#22020) Benji Beck 2025-08-21 20:28:49 -07:00
  • 8896eb72eb [Deprecation] Remove prompt_token_ids arg fallback in LLM.generate and LLM.embed (#18800) Cyrus Leung 2025-08-22 10:56:57 +08:00
  • 19fe1a0510 [Kernel] Add FP8 support with FlashMLA backend (#22668) Matthew Bonanni 2025-08-21 22:26:32 -04:00
  • 480bdf5a7b [Core] Support custom executor qualname (#23314) 22quinn 2025-08-21 18:40:54 -07:00
  • 5368f76855 [Feature][Responses API] Support logprobs(non-stream) (#23319) Kebe 2025-08-22 07:09:16 +08:00
  • 8ef6b8a38c Always use cache mounts when installing vllm to avoid populating pip cache in the image. Also remove apt cache. (#23270) tvalentyn 2025-08-22 00:01:03 +02:00
  • 3bbe11cc13 [Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm (#23265) Michael Goin 2025-08-21 17:56:15 -04:00
  • c5041f899f [CI] improve pr comments bot (#23380) Simon Mo 2025-08-21 14:49:03 -07:00
  • 8b5fe6eb51 [CI] Clean up actions: remove helm, publish workflows and improve pr … (#23377) Simon Mo 2025-08-21 14:29:04 -07:00
  • 800349c2a5 [Structured Outputs] Refactor bitmask construction into get_grammar_bitmask (#23361) Woosuk Kwon 2025-08-21 13:53:33 -07:00
  • 044931f97b Make sure that vectorize_with_alignment produced vectorized global loads (#23182) Elvir Crnčević 2025-08-21 22:06:54 +02:00
  • 1d353b6352 [Core] Always use tensor cores for Flashinfer Decode Wrapper (#23214) Pavani Majety 2025-08-21 13:02:11 -07:00