Commit Graph

  • 08d81f1014 [Bugfix] Fix deepep tests (#20288) Varun Sundar Rabindranath 2025-07-01 03:29:08 -04:00
  • 6cc1e7d96d [CPU] Update custom ops for the CPU backend (#20255) Li, Jiang 2025-07-01 15:25:03 +08:00
  • 9909726d2a Enable ZP Support for Machete (#20268) czhu-cohere 2025-07-01 00:12:20 -07:00
  • 22e9d42040 [Misc] add xgrammar for arm64 (#18359) Prashant Gupta 2025-07-01 00:02:20 -07:00
  • 86debab54c Fix numel() downcast in vllm/csrc/moe/moe_align_sum_kernels.cu +2 (#17082) Richard Barnes 2025-07-01 00:48:10 -06:00
  • be250bbc67 [V1] Only print cudagraph tqdm on rank 0 with is_global_first_rank (#19516) Michael Goin 2025-07-01 15:02:09 +09:00
  • 27949354fa [Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference (#18768) Alex Kogan 2025-07-01 01:44:38 -04:00
  • bd5038af07 [Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA (#15897) Ernest Wong 2025-06-30 21:44:39 -07:00
  • a2f14dc8f9 [CI][Intel Gaudi][vllm-Plugin]Add CI for hpu-plugin-v1-test (#20196) Chendi.Xue 2025-06-30 23:17:07 -05:00
  • 92ee7baaf9 [Example] add one-click runnable example for P2P NCCL XpYd (#20246) Kuntai Du 2025-06-30 21:03:55 -07:00
  • 7151f92241 [Misc] Fix spec decode example (#20296) Woosuk Kwon 2025-06-30 21:01:48 -07:00
  • e28533a16f [Bugfix] Fix include prompt in stream response when echo=true (#15233) fyuan1316 2025-07-01 09:30:14 +08:00
  • 6d42ce8315 [CLI] Improve CLI arg parsing for -O/--compilation-config (#20156) Luka Govedič 2025-06-30 21:03:13 -04:00
  • ded1fb635b [Bugfix][V1][P/D]Fix the issue of occasional garbled output for P2pNcclConnector (#20263) Zhonghua Deng 2025-07-01 07:45:14 +08:00
  • 97d9524fe9 [Refactor] Remove useless pdb comment (#20266) Wentao Ye 2025-06-30 14:15:24 -04:00
  • d8cf819a9a [Core] [Bugfix] [Multimodal] Fix multimodal profiling and generation for SFT/PTQed models (#20058) Kyle Sayers 2025-06-30 13:26:49 -04:00
  • 551ef1631a [Unit Test] Add unit test for deep gemm (#20090) Wentao Ye 2025-06-30 12:26:42 -04:00
  • 2863befce3 [Optimization] Use Shared CachedRequestData Instance Across All Requests (#20232) Woosuk Kwon 2025-06-30 09:07:50 -07:00
  • 2965c99c86 [Spec Decode] Clean up spec decode example (#20240) Woosuk Kwon 2025-06-30 08:28:13 -07:00
  • 2062c0723d [Spec Decode] Refactor spec decoding into a separate function (#20238) Woosuk Kwon 2025-06-30 08:13:50 -07:00
  • 1c50e100a9 [Bugfix] fix quark ptpc (#20251) li haoyang 2025-06-30 21:24:50 +08:00
  • 3ee56e26be [Docs] Fix 1-2-3 list in v1/prefix_caching.md (#20243) Michael Yao 2025-06-30 19:20:51 +08:00
  • 8fe7fc8634 [Quantization] Improve BitsAndBytesModelLoader (#20242) Jee Jee Li 2025-06-30 18:22:09 +08:00
  • e936e401de [Bugfix] Fix processor initialization in transformers 4.53.0 (#20244) Isotr0py 2025-06-30 18:16:16 +08:00
  • f5dfa07531 [Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model (#19598) noiji 2025-06-30 18:21:56 +09:00
  • 022c58b80f [doc] Add Slack and Forum to the top navigation (#20208) Reid 2025-06-30 15:53:45 +08:00
  • 19108ef311 [Misc] Fix import (#20233) Woosuk Kwon 2025-06-29 20:34:54 -07:00
  • 5a52f389dd [BUGFIX][DEEPSEEK][MODEL_LOAD] fix w13, w2 weight not initialized assert (#20202) Chendi.Xue 2025-06-29 21:46:19 -05:00
  • 65b1cbb138 [Model] support dots1 (#18254) redmoe-moutain 2025-06-30 10:34:36 +08:00
  • 6c9837a761 Fix cuda_archs_loose_intersection when handling sm_*a (#20207) Huy Do 2025-06-29 16:52:34 -07:00
  • 6f2f53a82d [Quantization] Add compressed-tensors NVFP4 MoE Support (#19990) Dipika Sikka 2025-06-30 00:05:40 +02:00
  • 7b1895e6ce [CI Fix] Try fixing eagle e2e test OOM by reducing block allocation (#20213) Michael Goin 2025-06-29 11:31:37 +09:00
  • 4d36693687 [Refactor] Create a function util and cache the results for has_deepgemm, has_deepep, has_pplx (#20187) Wentao Ye 2025-06-28 18:06:38 -04:00
  • daec9dea6e [Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution (#20137) Stan Wozniak 2025-06-28 17:16:41 +02:00
  • daceac57c7 [Frontend] Generalize v1/audio/transcriptions endpoint (#20179) Nicolò Lucchesi 2025-06-28 17:15:26 +02:00
  • 8615d9776f [CI/Build] Add new CI job to validate Hybrid Models for every PR (#20147) Thomas Parnell 2025-06-28 08:00:25 +02:00
  • 7b460c25f9 [BugFix] Fix the incorrect func name in the comments. (config.py) (#20185) Jiayi Yan 2025-06-28 13:51:16 +08:00
  • f719772281 [Bugfix] Properly reject requests with empty list guided_choice (#20195) Michael Goin 2025-06-28 14:50:52 +09:00
  • d45417b804 fix ci issue distributed 4 gpu test (#20204) Wentao Ye 2025-06-28 01:50:00 -04:00
  • a29e62ea34 Fix num_token_padding support for static per-tensor scaled_fp8_quant (#20188) Michael Goin 2025-06-28 14:48:13 +09:00
  • e53be6f00a [Misc] Add type assertion of request_id for LLMEngine.add_request (#19700) Chales Xu 2025-06-28 13:47:36 +08:00
  • c329ceca6d [CI Fix] Pin tests/models/registry.py MiniMaxText01ForCausalLM to revision due to model changes (#20199) Michael Goin 2025-06-28 14:43:06 +09:00
  • 3c545c0c3b [CI/Build] Allow hermetic builds (#18064) Fabien Dupont 2025-06-27 18:04:39 +02:00
  • e8c3bd2cd1 [Bugfix] Fix some narrowing conversion warnings (#20141) Tyler Michael Smith 2025-06-27 12:01:28 -04:00
  • c6c983053d [Bugfix] Mark 'hidden_states' as mutable in moe_forward registration. (#20152) bnellnm 2025-06-27 11:42:22 -04:00
  • aafabaa0d5 [Fix][torch.compile] Enable custom ops by default when Inductor off (#20102) Luka Govedič 2025-06-27 11:00:42 -04:00
  • 94a55c7681 [Fix][ROCm] Remove unused variables to fix build error on GFX11/12 (#19891) Hosang 2025-06-27 10:14:44 -04:00
  • aa0dc77ef5 [Perf] Improved perf for resolve_chat_template_content_format (#20065) Ilya Lavrenov 2025-06-27 13:16:41 +04:00
  • 4ab3ac285e [Bugfix] Fix flaky failure when getting DP ports (#20151) Michael Goin 2025-06-27 16:30:53 +09:00
  • d1c956dc0f Gemma3n (Text-only) (#20134) Robert Shaw 2025-06-27 03:16:26 -04:00
  • dec197e3e5 Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn (#20143) Chendi.Xue 2025-06-27 00:48:13 -05:00
  • 6e244ae091 [Perf][Frontend] eliminate api_key and x_request_id headers middleware overhead (#19946) Yazan Sharaya 2025-06-27 07:44:14 +03:00
  • cd4cfee689 [Model][1/N] Automatic conversion of CrossEncoding model (#20012) wang.yuqi 2025-06-27 12:10:04 +08:00
  • e110930680 [Fix] Fix gemma CI test failing on main (#20124) Thomas Parnell 2025-06-27 06:06:59 +02:00
  • 8b64c895c0 [CI] Sync test dependency with test.in for torch nightly (#19632) Yang Wang 2025-06-26 20:55:25 -07:00
  • 0740e29b66 [Feature] add quick all reduce (#19744) li haoyang 2025-06-27 11:54:24 +08:00
  • 44d2e6af63 [Bugfix] Build moe_data for both sm100 and sm90 (#20086) Michael Goin 2025-06-27 12:50:12 +09:00
  • 2d7779f888 [Perf] SM100 FP8 GEMM Optimizations after cutlass_profiler (#20071) Ilya Markov 2025-06-27 05:50:09 +02:00
  • a57d57fa72 [Quantization] Bump to use latest compressed-tensors (#20033) Dipika Sikka 2025-06-26 23:50:06 -04:00
  • 71799fd005 [CI Failure] Fix OOM with test_oot_registration_embedding (#20144) Michael Goin 2025-06-27 12:21:04 +09:00
  • e9fd658a73 [Feature] Expert Parallelism Load Balancer (EPLB) (#18343) Bowen Wang 2025-06-26 15:30:21 -07:00
  • 07b8fae219 [Doc] correct LoRA capitalization (#20135) Kyle Yu 2025-06-26 18:22:12 -04:00
  • 562308816c [Refactor] Rename commnication utils (#20091) Wentao Ye 2025-06-26 18:19:32 -04:00
  • 04e1642e32 [TPU] add kv cache update kernel (#19928) Chengji Yao 2025-06-26 10:01:37 -07:00
  • b69781f107 [Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. (#19560) Kunshang Ji 2025-06-27 00:27:18 +08:00
  • 0bceac9810 Spam folks if config.py changes (#20131) Tyler Michael Smith 2025-06-26 11:19:46 -04:00
  • 34878a0b48 [Doc] Rename page titles (#20130) Cyrus Leung 2025-06-26 23:18:49 +08:00
  • 6393b03986 [Doc] Auto sign-off for VSCode (#20132) Cyrus Leung 2025-06-26 23:18:36 +08:00
  • 0907d507bf [Doc] Automatically signed-off by PyCharm (#20120) wang.yuqi 2025-06-26 22:34:17 +08:00
  • c894c5dc1f [Bug Fix] Fix address/port already in use error for deep_ep test (#20094) Wentao Ye 2025-06-26 10:33:13 -04:00
  • 1f5d178e9c Revert "[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine" (#20128) Michael Goin 2025-06-26 23:32:22 +09:00
  • 27c065df50 [Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break and Local Attention Logic: affecting Llama4) (#19904) TJian 2025-06-26 05:42:31 -07:00
  • 84c260caeb [Docs] Improve frameworks/helm.md (#20113) Michael Yao 2025-06-26 18:41:51 +08:00
  • 167aca45cb [Misc] Use collapsible blocks for benchmark examples. (#20017) Reid 2025-06-26 18:35:16 +08:00
  • 0567c8249f [CPU] Fix torch version in x86 CPU backend (#19258) Li, Jiang 2025-06-26 18:34:47 +08:00
  • d188913d99 [Refactor] Remove unused library (#20099) Wentao Ye 2025-06-26 05:16:10 -04:00
  • 1d7c29f5fe [Doc] Update docs for New Model Implementation (#20115) Cyrus Leung 2025-06-26 15:47:06 +08:00
  • 65397e40f5 [Bugfix] Allow CUDA_VISIBLE_DEVICES='' in Platform.device_id_to_physical_device_id (#18979) Seiji Eicher 2025-06-26 00:01:57 -07:00
  • 9502c38138 [Benchmark][Bug] Fix multiple bugs in bench and add args to spec_decode offline (#20083) Ekagra Ranjan 2025-06-26 01:06:27 -04:00
  • 2582683566 [PD] Skip tp_size exchange with rank0 (#19413) Nicolò Lucchesi 2025-06-26 05:04:39 +02:00
  • 754b00edb3 [Bugfix] Fix Mistral tool-parser regex for nested JSON (#20093) Michael Goin 2025-06-26 10:01:17 +09:00
  • 296ce95d8e [CI] Add SM120 to the Dockerfile (#19794) Michael Goin 2025-06-26 08:23:56 +09:00
  • 2d7620c3eb [TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN (#19919) Chenyaaang 2025-06-25 15:51:02 -07:00
  • 55c65ab495 [P/D] Avoid stranding blocks in P when aborted in D's waiting queue (#19223) Nick Hill 2025-06-25 15:19:44 -07:00
  • 2cc2069970 [TPU][Bugfix] fix kv cache padding (#20048) Chengji Yao 2025-06-25 14:24:10 -07:00
  • 9f0608fc16 [Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine (#20062) zhrrr 2025-06-26 05:03:17 +08:00
  • 4e0db57fff Fix the path to the testing script. (#20082) QiliangCui 2025-06-25 13:48:17 -07:00
  • c40692bf9a [Misc] Add parallel state node_count function (#20045) Nick Hill 2025-06-25 13:38:53 -07:00
  • 4734704b30 [PD] let toy proxy handle /chat/completions (#19730) lkchen 2025-06-25 12:17:45 -07:00
  • 8b8c209e35 static_scaled_fp8_quant should not run when scale.numel is not 1 (#20076) Eldar Kurtić 2025-06-25 21:08:03 +02:00
  • 23a04e0895 [Fix] Support cls pooling in ModernBertPooler (#20067) lsz05 2025-06-26 04:07:45 +09:00
  • 02c97d9a92 [Quantization] Add compressed-tensors emulations support for NVFP4 (#19879) Dipika Sikka 2025-06-25 14:28:19 -04:00
  • e795d723ed [Frontend] Add /v1/audio/translations OpenAI API endpoint (#19615) Nicolò Lucchesi 2025-06-25 19:54:14 +02:00
  • 8359f4c8d8 [V1][Speculative Decoding] Fix DeepSeek MTP (#20022) cjackal 2025-06-26 00:41:02 +09:00
  • bf5181583f [Doc] Guide for Incremental Compilation Workflow (#19109) Michael Goin 2025-06-25 22:06:46 +09:00
  • c53fec1fcb [doc] add reference link for Intel XPU (#20064) Reid 2025-06-25 20:24:07 +08:00
  • 0f9e7354f5 [BugFix] Fix full-cuda-graph illegal memory access in FA3 (#20057) Lucas Wilkinson 2025-06-25 04:39:04 -04:00
  • ba7ba35cda [Chore] debloat some initial logs (#19438) Aaron Pham 2025-06-25 02:36:22 -04:00
  • 015fab8c2f [Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. (#19717) bnellnm 2025-06-25 02:22:58 -04:00
  • f59fc60fb3 [Feat][CLI] enforce-include-usage (#19695) Max Wittig 2025-06-25 07:43:04 +02:00