Commit Graph

  • a5cfbab3c8 [Core] LoRA: V1 Scheduler optimization (#15422) Varun Sundar Rabindranath 2025-03-25 15:50:09 -07:00
  • ac3cd6e83c [core] add bucket padding to tpu_model_runner (#14995) Chenyaaang 2025-03-25 14:27:22 -07:00
  • 082ab86f5f [V1] Support long_prefill_token_threshold in v1 scheduler (#15419) Lu Fang 2025-03-25 14:22:26 -07:00
  • 6aa196c8dc [V1][Minor] Use SchedulerInterface type for engine scheduler field (#15499) Nick Hill 2025-03-25 14:21:36 -07:00
  • a0dd7dcd49 [TPU][V1] Fix Sampler recompilation (#15309) Nicolò Lucchesi 2025-03-25 21:43:54 +01:00
  • e977c11111 Add workaround for shared field_names in pydantic model class (#13925) Maximilien de Bayser 2025-03-25 17:31:08 -03:00
  • 5f063a80bd [bugfix] add supports_v1 platform interface (#15417) Joe Runde 2025-03-25 15:00:32 -04:00
  • 5d8e1c9279 [Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility) (#15471) Antonio Gómez 2025-03-25 18:59:25 +01:00
  • 0a049c7d86 [CI/Build] Add tests for the V1 tpu_model_runner. (#14843) yarongmu-google 2025-03-25 09:27:16 -07:00
  • d0cfec7ab9 [bugfix] fix inductor cache on max_position_embeddings (#15436) youkaichao 2025-03-25 22:05:39 +08:00
  • a608160027 [Kernel] Fix conflicting macro names for gguf kernels (#15456) Szymon Ożóg 2025-03-25 14:50:49 +01:00
  • 3f04a7fbf2 [Doc] Update V1 user guide for multi-modality (#15460) Cyrus Leung 2025-03-25 19:01:58 +08:00
  • 5994430b84 [Misc] Remove redundant num_embeds (#15443) Cyrus Leung 2025-03-25 18:27:57 +08:00
  • a9e879b316 [Misc] Clean up MiniCPM-V/O code (#15337) Cyrus Leung 2025-03-25 18:22:52 +08:00
  • 3e2f37a69a Dockerfile.ppc64le changes to move to UBI (#15402) Md. Shafi Hussain 2025-03-25 15:45:14 +05:30
  • 4f044b1d67 [Kernel][CPU] CPU MLA (#14744) Thien Tran 2025-03-25 17:34:59 +08:00
  • 4157f563b4 [Hardware][TPU][Bugfix] Fix v1 mp profiler (#15409) Siyuan Liu 2025-03-25 01:43:00 -07:00
  • 051da7efe3 Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 (#15160) Lu Fang 2025-03-25 00:36:45 -07:00
  • 25f560a62c [V1][Spec Decode] Update target_logits in place for rejection sampling (#15427) v0.8.2 Woosuk Kwon 2025-03-24 21:04:41 -07:00
  • a09ad90a72 [V1] guidance backend for structured output + auto fallback mode (#14779) Russell Bryant 2025-03-25 00:02:33 -04:00
  • 10b34e36b9 [Bugfix] Fixed the issue of not being able to input video and image simultaneously (#15387) Chauncey 2025-03-25 11:48:08 +08:00
  • b5269db959 Revert "Fix non-contiguous input passed to Marlin kernel (#15319)" (#15398) Tyler Michael Smith 2025-03-24 23:43:51 -04:00
  • 6db94571d7 [Misc] Remove LoRA log (#15388) Jee Jee Li 2025-03-25 11:43:48 +08:00
  • 97cfa65df7 Add pipeline parallel support to TransformersModel (#12832) Harry Mellor 2025-03-25 02:41:45 +00:00
  • 911c8eb000 [Minor][Spec Decode] Remove compiled_softmax (#15416) Woosuk Kwon 2025-03-24 19:09:04 -07:00
  • ebcebeeb6b [V1][Spec Decode] Enable spec decode for top-p & top-k sampling (#15063) Woosuk Kwon 2025-03-24 17:16:46 -07:00
  • f533b5837f [ROCm][Kernel] MoE weights padding (#14454) Gregory Shtrasberg 2025-03-24 19:45:30 -04:00
  • 8279201ce6 [Build] Cython compilation support fix (#14296) Gregory Shtrasberg 2025-03-24 19:37:54 -04:00
  • 23fdab00a8 [Hardware][TPU] Skip failed compilation test (#15421) Siyuan Liu 2025-03-24 16:28:57 -07:00
  • 623e2ed29f [BugFix][V1] Quick fix for min_tokens with multiple EOS (#15407) Nick Hill 2025-03-24 15:58:59 -07:00
  • 9d72daf4ce [V1][Perf] Simpler request output queues (#15156) Nick Hill 2025-03-24 15:44:08 -07:00
  • 6dd55af6c9 [Doc] Update docs on handling OOM (#15357) Cyrus Leung 2025-03-25 05:29:34 +08:00
  • 3eb08ed9b1 [DOC] Add Kubernetes deployment guide with CPUs (#14865) Yuan Tang 2025-03-24 13:48:43 -04:00
  • 5eeadc2642 [Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral (#12303) liuzhenwei 2025-03-25 00:48:40 +08:00
  • 3aee6573dc [V1] Aggregate chunked prompt logprobs in model runner (#14875) Nick Hill 2025-03-24 09:27:57 -07:00
  • 9cc645141d [MISC] Refine no available block debug msg (#15076) Yi Liu 2025-03-25 00:01:10 +08:00
  • 0893567db9 [V1][Minor] fix comments (#15392) Chen1022 2025-03-24 23:45:32 +08:00
  • 8abe69b499 [Core] Don't force uppercase for VLLM_LOGGING_LEVEL (#15306) Russell Bryant 2025-03-24 11:27:30 -04:00
  • 761702fd19 [Core] Integrate fastsafetensors loader for loading model weights (#10647) Manish Sethi 2025-03-24 11:08:02 -04:00
  • 9606d572ed [distributed] fix dp group (#15355) youkaichao 2025-03-24 22:54:27 +08:00
  • cbcdf2c609 [Bugfix] Fix chat template loading (#15143) Cyrus Leung 2025-03-24 21:50:09 +08:00
  • 038de04d7b Fix zmq IPv6 URL format error (#15341) Russell Bryant 2025-03-24 09:30:41 -04:00
  • 6b3cc75be0 [Kernel] allow non-contiguous input for marlin kernel (#14658) Jinzhen Lin 2025-03-24 21:21:33 +08:00
  • 7ffcccfa5c Revert "[CI/Build] Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377) Simon Mo 2025-03-24 05:53:10 -07:00
  • cc8accfd53 [Misc] Update guided decoding logs to debug (#15310) sfbemerk 2025-03-24 12:25:20 +01:00
  • 948ab03e7e [Bugfix][V1] Avoid importing PreTrainedModel (#15366) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2025-03-24 12:33:12 +02:00
  • 5797fb97e9 [Misc] Remove ignore_reinit_error for ray.init() (#15373) Rui Qiao 2025-03-24 00:41:53 -07:00
  • 3892e58ad7 [Misc] Upgrade BNB version (#15183) Jee Jee Li 2025-03-24 13:51:42 +08:00
  • d20e261199 Fix non-contiguous input passed to Marlin kernel (#15319) Qubitium-ModelCloud 2025-03-24 11:09:44 +08:00
  • f622dbcf39 [Fix] [torch.compile] Improve UUID system for custom passes (#15249) Luka Govedič 2025-03-23 21:54:07 -04:00
  • dccf535f8e [V1] Enable V1 Fp8 cache for FA3 in the oracle (#15191) Lucas Wilkinson 2025-03-23 18:07:04 -04:00
  • 9c5c81b0da [Misc][Doc] Add note regarding loading generation_config by default (#15281) Roger Wang 2025-03-23 14:00:55 -07:00
  • d6cd59f122 [Frontend] Support tool calling and reasoning parser (#14511) Robin 2025-03-24 05:00:07 +08:00
  • bc8ed3c4ba [V1][Spec Decode] Use better defaults for N-gram (#15358) Woosuk Kwon 2025-03-23 10:52:30 -07:00
  • b9bd76ca14 [V1][Spec Decode] Respect prompt_lookup_max (#15348) Woosuk Kwon 2025-03-23 10:41:44 -07:00
  • 6ebaf9ac71 [Bugfix] consider related env vars for torch.compiled cache hash (#14953) DefTruth 2025-03-23 23:53:09 +08:00
  • f90d34b498 [Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20 (#15322) DefTruth 2025-03-23 16:10:10 +08:00
  • f68cce8e64 [ci/build] fix broken tests in LLM.collective_rpc (#15350) youkaichao 2025-03-23 14:49:48 +08:00
  • 09b6a95551 [ci/build] update torch nightly version for GH200 (#15135) youkaichao 2025-03-23 14:04:13 +08:00
  • 50c9636d87 [V1][Usage] Refactor speculative decoding configuration and tests (#14434) shangmingc 2025-03-23 13:28:10 +08:00
  • 0661cfef7a Fix v1 supported oracle for worker-cls and worker-extension-cls (#15324) hijkzzz 2025-03-23 10:23:35 +08:00
  • a827aa815d [doc] Add back previous news (#15331) Chen Zhang 2025-03-23 08:38:33 +08:00
  • b877031d80 Remove openvino support in favor of external plugin (#15339) Russell Bryant 2025-03-22 17:06:39 -04:00
  • dd861b992f [BugFix][Typing] Fix Imprecise Type Annotations (#15208) Wang Ran (汪然) 2025-03-23 00:05:03 +08:00
  • eb63ea1e18 [V1] Add disable-any-whitespace option support for xgrammar (#15316) Russell Bryant 2025-03-22 11:56:17 -04:00
  • 2f4bd358f1 [Model] Support Tele-FLM Model (#15023) Naitong Yu 2025-03-22 17:04:44 +08:00
  • 8a8b30eac1 [Bugfix] LoRA V0 - Fix case where max_num_seqs is between cudagraph capture sizes (#15308) Varun Sundar Rabindranath 2025-03-22 05:03:32 -04:00
  • 2fa0e1396b [Bugfix] Fix torch.compile raise FileNotFoundError (#15278) Jee Jee Li 2025-03-22 13:49:34 +08:00
  • 1c2bec0f82 [Doc] add load_format items in docs (#14804) wwl2755 2025-03-22 00:36:43 -05:00
  • ec870fba9a [FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature (#14959) TJian 2025-03-22 13:36:14 +08:00
  • df1430265c [Bugfix][V0] Multi-sequence logprobs streaming edge case (#15259) Andy Lo 2025-03-22 05:35:37 +00:00
  • 4c69e228b3 [Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout (#15301) Rui Qiao 2025-03-21 22:25:43 -07:00
  • 790b79750b [Build/CI] Fix env var typo (#15305) Russell Bryant 2025-03-21 18:28:46 -04:00
  • cfbb8c930f [TPU][V1] MHA Pallas backend (#15288) Nicolò Lucchesi 2025-03-21 16:50:39 +01:00
  • baec0d4de9 Revert "[Feature] specify model in config.yaml (#14855)" (#15293) Cyrus Leung 2025-03-21 23:30:23 +08:00
  • c21b99b912 [Bugfix][VLM] fix llava processor (#15285) Mengqing Cao 2025-03-21 20:14:36 +08:00
  • 93a00d7dde [v1] Refactor KVCacheConfig (#14079) Chen Zhang 2025-03-21 19:56:27 +08:00
  • 61e8c18350 [Misc] Add cProfile helpers (#15074) Russell Bryant 2025-03-21 07:56:09 -04:00
  • 8afcd0f633 [Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend (#15282) Isotr0py 2025-03-21 19:42:06 +08:00
  • 91ca929dc7 [V1] Fix wrong import path of get_flash_attn_version (#15280) Lehua Ding 2025-03-21 18:54:11 +08:00
  • 84e00adc8a [Bugfix] Fix incorrect resolving order for transformers fallback (#15279) Isotr0py 2025-03-21 18:54:08 +08:00
  • 47c7126213 [Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL (#15273) Isotr0py 2025-03-21 18:32:33 +08:00
  • a989ca2bf6 [Bugfix] Add int8 torch dtype for KVCache (#15260) Shanshan Shen 2025-03-21 16:58:28 +08:00
  • 0fa3970deb [Feature] specify model in config.yaml (#14855) Wei Zeng 2025-03-21 00:26:03 -07:00
  • da6ea29f7a [V1] Avoid redundant input processing in n>1 case (#14985) Nick Hill 2025-03-20 22:24:10 -07:00
  • 7297941b38 [Doc] Update LWS docs (#15163) Edwin Hernandez 2025-03-20 21:18:47 -07:00
  • f8a08cb90d [V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071) Isotr0py 2025-03-21 11:14:19 +08:00
  • b15fd2be2a [Hardware][TPU] Add check for no additional graph compilation during runtime (#14710) Siyuan Liu 2025-03-20 20:05:28 -07:00
  • e588ac237c Add an example for reproducibility (#15262) Woosuk Kwon 2025-03-20 19:55:47 -07:00
  • 5df2da5b97 [Misc] Better RayExecutor and multiprocessing compatibility (#14705) Cody Yu 2025-03-20 19:27:46 -07:00
  • 11b986b3fb [Docs] Trim the latest news in README (#15261) Woosuk Kwon 2025-03-20 19:24:21 -07:00
  • 296f927f24 [Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies (#14857) Chih-Chieh Yang 2025-03-20 19:21:08 -07:00
  • 0032903a5b [Bugfix] detect alibi and revert to FA2 (#15231) Travis Johnson 2025-03-20 20:20:16 -06:00
  • 47195057e9 [V1][TPU] Speed up top-k on TPU by using torch.topk (#15242) Hyesoo Yang 2025-03-20 19:19:40 -07:00
  • 6edbfa924d Mention extra_body as a way top pass vLLM only parameters using the OpenAI client (#15240) Harry Mellor 2025-03-21 02:18:36 +00:00
  • 1e508343e1 [Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation (#15200) Isotr0py 2025-03-21 10:18:04 +08:00
  • 2e0b4cfde0 [ROCM] Upgrade torch to 2.6 (#15244) Sage Moore 2025-03-20 19:17:33 -07:00
  • 10f55fe6c5 [Misc] Clean up the BitsAndBytes arguments (#15140) Jee Jee Li 2025-03-21 10:17:12 +08:00
  • d3ccbd6350 Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 (#15159) Lu Fang 2025-03-20 19:01:11 -07:00
  • 0cfe7d386d [CI/Build] LoRA : make add_lora_test safer (#15181) Varun Sundar Rabindranath 2025-03-20 21:28:53 -04:00