Commit Graph

  • 6d917d0eeb Enable mypy checking on V1 code (#11105) Mark McLoughlin 2024-12-14 17:54:04 +00:00
  • 93abf23a64 [VLM] Fully dynamic prompt replacement in merged input processor (#11199) Cyrus Leung 2024-12-15 01:52:18 +08:00
  • 9c3dadd1c9 [Frontend] Add logits_processors as an extra completion argument (#11150) Brad Hilton 2024-12-14 09:46:42 -07:00
  • 3cb5769883 [Misc] Minor improvements to the readability of PunicaWrapperBase (#11200) Jee Jee Li 2024-12-15 00:38:27 +08:00
  • ea7bd68d10 [V1][Bugfix] Fix V1 TP trust-remote-code (#11182) Tyler Michael Smith 2024-12-14 03:21:23 -05:00
  • 48259264a4 [Core] Update outlines and increase its threadpool size (#11140) Russell Bryant 2024-12-14 02:46:18 -05:00
  • 24a3d12b82 update compressed-tensors to latest version (#11183) dhuangnm 2024-12-13 22:22:44 -05:00
  • 9855aea21b [Bugfix][V1] Re-compute an entire block when fully cache hit (#11186) Cody Yu 2024-12-13 17:08:23 -08:00
  • 4b5b8a6a3b [V1][Bugfix] Fix EngineCoreProc profile (#11185) Tyler Michael Smith 2024-12-13 20:02:35 -05:00
  • 4863e5fba5 [Core] V1: Use multiprocessing by default (#11074) Russell Bryant 2024-12-13 19:27:32 -05:00
  • 0d8451c3a4 [Distributed] Allow the placement group more time to wait for resources to be ready (#11138) Jiaxin Shan 2024-12-13 12:17:37 -08:00
  • 0a56bcc03d [Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend (#11169) Jani Monoses 2024-12-13 20:00:40 +02:00
  • 0920ab9131 [Doc] Reorganize online pooling APIs (#11172) Cyrus Leung 2024-12-14 00:22:22 +08:00
  • 238c0d93b4 [Misc] Add tokenizer_mode param to benchmark_serving.py (#11174) Alexander Matveev 2024-12-13 11:19:10 -05:00
  • 5b0ed8391d [Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in AllowedTokenIdsLogitsProcessor (#11156) zhangjf 2024-12-13 23:56:19 +08:00
  • c31d4a57a6 [Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching (#8240) Sungjae Lee 2024-12-14 00:51:25 +09:00
  • d1fa714cb1 [Refactor]A simple device-related refactor (#11163) Chenguang Li 2024-12-13 21:39:00 +08:00
  • 969da7d70b [V1][VLM] Fix edge case bug for InternVL2 (#11165) Roger Wang 2024-12-13 03:09:30 -08:00
  • eeec9e3390 [Frontend] Separate pooling APIs in offline inference (#11129) Cyrus Leung 2024-12-13 18:40:07 +08:00
  • f93bf2b189 [Bugfix][CI][CPU] add missing datasets package to requirements-cpu.txt (#11159) Li, Jiang 2024-12-13 16:50:35 +08:00
  • 7cd7409142 PaliGemma 2 support (#11142) Jani Monoses 2024-12-13 09:40:07 +02:00
  • be39e3cd18 [core] clean up cudagraph batchsize padding logic (#10996) youkaichao 2024-12-12 22:57:50 -08:00
  • 34f1a806d5 [Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' (#11157) Cody Yu 2024-12-12 22:30:06 -08:00
  • 00c1bde5d8 [ROCm][AMD] Disable auto enabling chunked prefill on ROCm (#11146) Gregory Shtrasberg 2024-12-13 00:31:26 -05:00
  • 3989a79824 [Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quantization (#11148) Dipika Sikka 2024-12-13 00:07:20 -05:00
  • 1efce68605 [Bugfix] Use runner_type instead of task in GritLM (#11144) Pooya Davoodi 2024-12-12 20:09:53 -08:00
  • 30870b4f66 [torch.compile] Dynamic fp8 + rms_norm fusion (#10906) Luka Govedič 2024-12-12 22:19:23 -05:00
  • 78ed8f57d8 [Misc][V1] Fix type in v1 prefix caching (#11151) Cody Yu 2024-12-12 16:57:40 -08:00
  • db6c264a1e [Bugfix] Fix value unpack error of simple connector for KVCache transfer. (#11058) shangmingc 2024-12-13 05:19:17 +08:00
  • 9f3974a319 Fix logging of the vLLM Config (#11143) Jeremy Arnold 2024-12-12 14:05:57 -06:00
  • 2c97eca1ff [Misc] Validate grammar and fail early (#11119) Cody Yu 2024-12-12 10:34:26 -08:00
  • 5d712571af [Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e227c7ae. (#11024) Jeff Cook 2024-12-12 11:09:20 -07:00
  • d4d5291cc2 fix(docs): typo in helm install instructions (#11141) Ramon Ziai 2024-12-12 18:36:32 +01:00
  • 4816d20aa4 [V1] Fix torch profiling for offline inference (#11125) Roger Wang 2024-12-12 07:51:53 -08:00
  • 85362f028c [Misc][LoRA] Ensure Lora Adapter requests return adapter name (#11094) Jiaxin Shan 2024-12-12 01:25:16 -08:00
  • 62de37a38e [core][distributed] initialization from StatelessProcessGroup (#10986) youkaichao 2024-12-12 01:04:19 -08:00
  • 8195824206 [Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) (#10565) Sanju C Sudhakaran 2024-12-12 13:39:28 +05:30
  • f092153fbe [V1] Use more persistent buffers to optimize input preparation overheads (#11111) Woosuk Kwon 2024-12-11 23:14:20 -08:00
  • 1da8f0e1dd [Model] Add support for embedding model GritLM (#10816) Pooya Davoodi 2024-12-11 22:39:16 -08:00
  • ccede2b264 [Core] cleanup zmq ipc sockets on exit (#11115) Russell Bryant 2024-12-11 22:12:24 -05:00
  • 24a36d6d5f Update link to LlamaStack remote vLLM guide in serving_with_llamastack.rst (#11112) Yuan Tang 2024-12-11 21:39:21 -05:00
  • 8fb26dac61 [Docs] Add media kit (#11121) Simon Mo 2024-12-11 17:33:11 -08:00
  • 7439a8b5fc [Bugfix] Multiple fixes to tool streaming with hermes and mistral (#10979) Clayton 2024-12-11 17:10:12 -08:00
  • 4e11683368 [V1] VLM preprocessor hashing (#11020) Alexander Matveev 2024-12-11 19:55:30 -05:00
  • 452a723bf2 [V1][Core] Remove should_shutdown to simplify core process termination (#11113) Tyler Michael Smith 2024-12-11 18:34:54 -05:00
  • d1e21a979b [CI/Build] Split up VLM tests (#11083) Cyrus Leung 2024-12-12 06:18:16 +08:00
  • 72ff3a9686 [core] Bump ray to use _overlap_gpu_communication in compiled graph tests (#10410) Rui Qiao 2024-12-11 11:36:35 -08:00
  • 66aaa7722d [torch.compile] remove graph logging in ci (#11110) youkaichao 2024-12-11 10:59:50 -08:00
  • d643c2aba1 [V1] Use input_ids as input for text-only models (#11032) Woosuk Kwon 2024-12-11 10:49:23 -08:00
  • 91642db952 [torch.compile] use depyf to dump torch.compile internals (#10972) youkaichao 2024-12-11 10:43:05 -08:00
  • fd22220687 [Doc] Installed version of llmcompressor for int8/fp8 quantization (#11103) bingps 2024-12-11 23:43:24 +08:00
  • b2f775456e [CI/Build] Enable prefix caching test for AMD (#11098) hissu-hyvarinen 2024-12-11 17:23:37 +02:00
  • cad5c0a6ed [Doc] Update docs to refer to pooling models (#11093) Cyrus Leung 2024-12-11 21:36:27 +08:00
  • 8f10d5e393 [Misc] Split up pooling tasks (#10820) Cyrus Leung 2024-12-11 17:28:00 +08:00
  • 40766ca1b8 [Bugfix]: Clamp -inf logprob values in prompt_logprobs (#11073) Rafael Vasquez 2024-12-11 04:27:39 -05:00
  • 2e32f5d28d [Bugfix] Fix Idefics3 fails during multi-image inference (#11080) B-201 2024-12-11 17:27:07 +08:00
  • 61b1d2f6ae [Core] v1: Use atexit to handle engine core client shutdown (#11076) Russell Bryant 2024-12-11 04:26:36 -05:00
  • 9974fca047 [ci/build] Fix entrypoints test and pin outlines version (#11088) Kevin H. Luu 2024-12-11 01:01:53 -08:00
  • 3fb4b4f163 [ci/build] Fix AMD CI dependencies (#11087) Kevin H. Luu 2024-12-11 00:39:53 -08:00
  • 2e33fe4191 [CI/Build] Check transformers v4.47 (#10991) Cyrus Leung 2024-12-11 13:02:02 +08:00
  • e39400a4b6 Fix streaming for granite tool call when <|tool_call|> is present (#11069) Maximilien de Bayser 2024-12-11 01:51:40 -03:00
  • ffa48c9146 [Model] PP support for Mamba-like models (#10992) Mor Zusman 2024-12-11 04:53:37 +02:00
  • d5c5154fcf [Misc] LoRA + Chunked Prefill (#9057) Aurick Qiao 2024-12-10 21:09:20 -05:00
  • 9a93973708 [Bugfix] Fix Mamba multistep (#11071) Tyler Michael Smith 2024-12-10 19:16:22 -05:00
  • 134810b3d9 [V1][Bugfix] Always set enable_chunked_prefill = True for V1 (#11061) Woosuk Kwon 2024-12-10 14:41:23 -08:00
  • 75f89dc44c [torch.compile] add a flag to track batchsize statistics (#11059) youkaichao 2024-12-10 12:40:52 -08:00
  • e739194926 [Core] Update to outlines >= 0.1.8 (#10576) Russell Bryant 2024-12-10 15:08:16 -05:00
  • 250ee65d72 [BUG] Remove token param #10921 (#11022) Flávia Béo 2024-12-10 14:38:15 -03:00
  • 9b9cef3145 [Bugfix] Backport request id validation to v0 (#11036) Joe Runde 2024-12-10 09:38:23 -07:00
  • d05f88679b [Misc][LoRA] Add PEFTHelper for LoRA (#11003) Jee Jee Li 2024-12-10 19:12:01 +08:00
  • beb16b2c81 [Bugfix] Handle <|tool_call|> token in granite tool parser (#11039) Travis Johnson 2024-12-10 03:27:11 -07:00
  • fe2e10c71b Add example of helm chart for vllm deployment on k8s (#9199) Maxime Fournioux 2024-12-10 10:19:27 +01:00
  • 82c73fd510 [Bugfix] cuda error running llama 3.2 (#11047) Gene Der Su 2024-12-09 23:41:11 -08:00
  • bfd610430c Update README.md (#11034) Diego Marinho 2024-12-10 18:08:10 +11:00
  • e35879c276 [Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig on PixtralHF. (#11043) Jeff Cook 2024-12-09 23:54:22 -07:00
  • ebf778061d monitor metrics of tokens per step using cudagraph batchsizes (#11031) youkaichao 2024-12-09 22:35:36 -08:00
  • 28b3a1c7e5 [V1] Multiprocessing Tensor Parallel Support for v1 (#9856) Tyler Michael Smith 2024-12-10 01:28:14 -05:00
  • bc192a2b09 [Pixtral] Improve loading (#11040) Patrick von Platen 2024-12-10 07:09:32 +01:00
  • 980ad394a8 [Frontend] Use request id from header (#10968) Joe Runde 2024-12-09 22:46:29 -07:00
  • 391d7b2763 [Bugfix] Fix usage of deprecated decorator (#11025) Cyrus Leung 2024-12-10 13:45:47 +08:00
  • d1f6d1c8af [Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba (#10739) Isotr0py 2024-12-10 10:23:07 +08:00
  • 6d525288c1 [Docs] Add dedicated tool calling page to docs (#10554) Michael Goin 2024-12-09 20:15:34 -05:00
  • 6faec54505 [V1] Do not store None in self.generators (#11038) Woosuk Kwon 2024-12-09 15:08:19 -08:00
  • 5ed5d5f128 Build tpu image in release pipeline (#10936) Richard Liu 2024-12-09 15:07:48 -08:00
  • b63ba84832 [ROCm][bugfix] scpecilative decoding worker class (#11035) Gregory Shtrasberg 2024-12-09 17:00:29 -05:00
  • 9c6459e4cb [Neuron] Upgrade neuron to 2.20.2 (#11016) xendo 2024-12-09 22:53:24 +01:00
  • 1a2f8fb828 [v1] fix use compile sizes (#11000) youkaichao 2024-12-09 13:47:24 -08:00
  • cbcbdb1ceb [Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version (#11028) Konrad Zawora 2024-12-09 22:21:06 +01:00
  • a811dd6608 [Model] merged input processor for Phi-3-Vision models (#10977) Isotr0py 2024-12-10 04:55:10 +08:00
  • ca871491ed [Misc][LoRA] Abstract PunicaWrapper (#10955) Jee Jee Li 2024-12-10 04:54:44 +08:00
  • 3b61cb450d [V1] Further reduce CPU overheads in flash-attn (#10989) Woosuk Kwon 2024-12-09 12:38:46 -08:00
  • edc4fa3188 [ci/build] Recompile CI dependencies list with Python 3.12 (#11013) Kevin H. Luu 2024-12-09 11:46:58 -08:00
  • 25b79d9fd3 [V1] Input Batch Relocation (#10962) Varun Sundar Rabindranath 2024-12-09 12:33:41 -05:00
  • aea2fc38c3 [Platform] Move async output check to platform (#10768) wangxiyuan 2024-12-10 01:24:46 +08:00
  • e691b26f6f [Core] Require xgrammar >= 0.1.6 (#11021) Russell Bryant 2024-12-09 11:44:27 -05:00
  • c690357928 [V1] Fix Detokenizer loading in AsyncLLM (#10997) Roger Wang 2024-12-09 08:27:10 -08:00
  • d1c2e15eb3 [torch.compile] add dynamo time tracking (#11005) youkaichao 2024-12-08 23:09:04 -08:00
  • af7c4a92e6 [Doc][V1] Add V1 support column for multimodal models (#10998) Roger Wang 2024-12-08 22:29:16 -08:00
  • 46004e83a2 [misc] clean up and unify logging (#10999) youkaichao 2024-12-08 17:28:27 -08:00
  • 43b05fa314 [torch.compile][misc] fix comments (#10993) youkaichao 2024-12-08 11:18:18 -08:00