Commit Graph

  • 9013e24f7b [torch.compile] Adding torch compile annotations to some models (#9614) Yongzao 2024-10-24 01:07:48 +08:00
  • fd0e2cfdb2 [Misc] Separate total and output tokens in benchmark_throughput.py (#8914) Michael Goin 2024-10-23 12:47:20 -04:00
  • e5ac6a4199 [Bugfix] Fix divide by zero when serving Mamba models (#9617) Tyler Michael Smith 2024-10-23 12:40:43 -04:00
  • dbdd3b5e5a [misc] comment to avoid future confusion about baichuan (#9620) youkaichao 2024-10-23 09:14:44 -07:00
  • e7116c017c [Bugfix] Fix _init_vision_model in NVLM_D model (#9611) Cyrus Leung 2024-10-23 22:09:04 +08:00
  • 31a08f5bd2 [Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs (#9612) Alex Brooks 2024-10-23 08:05:18 -06:00
  • c18e1a3418 [VLM] Enable overriding whether post layernorm is used in vision encoder + fix quant args (#9217) Cyrus Leung 2024-10-23 19:27:37 +08:00
  • 3ff57ebfca [Model] Initialize Florence-2 language backbone support (#9555) Isotr0py 2024-10-23 18:42:47 +08:00
  • 2394962d70 [Hardware][XPU] using current_platform.is_xpu (#9605) Mengqing Cao 2024-10-23 16:28:21 +08:00
  • 51c24c9736 [Build] Fix FetchContent multiple build issue (#9596) Luka Govedič 2024-10-23 00:43:07 -04:00
  • 831540cf04 [Model] Support E5-V (#9576) Cyrus Leung 2024-10-23 11:35:29 +08:00
  • 29061ed9df [Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages (#9590) Flex Wang 2024-10-22 20:17:28 -07:00
  • 65050a40e6 [Bugfix] Generate exactly input_len tokens in benchmark_throughput (#9592) Chen Zhang 2024-10-22 17:45:35 -07:00
  • 208cb34c81 [Doc]: Update tensorizer docs to include vllm[tensorizer] (#7889) Seth Kimmel 2024-10-22 15:43:25 -07:00
  • b17046e298 [BugFix] Fix metrics error for --num-scheduler-steps > 1 (#8234) yulei 2024-10-23 06:43:03 +08:00
  • d1e8240875 [Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing (#9487) Lucas Wilkinson 2024-10-22 18:41:13 -04:00
  • cb6fdaa0a0 [Misc] Make benchmarks use EngineArgs (#9529) Jeremy Arnold 2024-10-22 17:40:38 -05:00
  • 23b899a8e6 [Bugfix] fix detokenizer shallow copy (#5919) Aurick Qiao 2024-10-22 18:38:12 -04:00
  • 17c79f3c36 [torch.compile] auto infer dynamic_arg_dims from type annotation (#9589) youkaichao 2024-10-22 13:43:37 -07:00
  • cd5601ac37 [BugFix] Prevent exporting duplicate OpenTelemetry spans (#9017) Ronen Schaffer 2024-10-22 21:11:53 +03:00
  • 434984e665 [Frontend] Support custom request_id from request (#9550) Yuhong Guo 2024-10-23 02:07:30 +08:00
  • 32a1ee74a0 [Hardware][Intel CPU][DOC] Update docs for CPU backend (#6212) Yuan 2024-10-22 10:38:04 -07:00
  • 08075c3448 [Bugfix] Eagle: change config name for fc bias (#9580) gopalsarda 2024-10-22 21:44:22 +05:30
  • bb392ea2d2 [Model][VLM] Initialize support for Mono-InternVL model (#9528) Isotr0py 2024-10-23 00:01:46 +08:00
  • 9dbcce84a7 [Neuron] [Bugfix] Fix neuron startup (#9374) xendo 2024-10-22 14:51:41 +02:00
  • a48e3ec052 [CI/Build][LoRA] Temporarily fix long context failure issue (#9579) Jee Jee Li 2024-10-22 19:32:51 +08:00
  • 6c5af09b39 [V1] Implement vLLM V1 [1/N] (#9289) Woosuk Kwon 2024-10-22 01:24:07 -07:00
  • 3ddbe25502 [Hardware][CPU] using current_platform.is_cpu (#9536) wangshuai09 2024-10-22 15:50:43 +08:00
  • 0d02747f2e support TP in qwen2 bnb (#9574) chenqianfzh 2024-10-22 00:13:23 -07:00
  • f7db5f0fa9 [Doc] Use shell code-blocks and fix section headers (#9508) Rafael Vasquez 2024-10-22 02:43:24 -04:00
  • ca30c3c84b [Core] Remove evictor_v1 (#9572) Kuntai Du 2024-10-21 23:55:49 -05:00
  • c0292211ce [CI/Build] Replaced some models on tests for smaller ones (#9570) Wallas Henrique 2024-10-22 01:52:14 -03:00
  • 74692421f7 [Bugfix]: phi.py get rope_theta from config file (#9503) Falko1 2024-10-22 04:53:36 +02:00
  • 29acd2c34c [Bugfix][OpenVINO] fix_dockerfile_openvino (#9552) ngrozae 2024-10-22 04:47:52 +02:00
  • f085995a7b [CI/Build] Remove unnecessary fork_new_process (#9484) Cyrus Leung 2024-10-22 10:47:29 +08:00
  • b729901139 [Bugfix]: serialize config by value for --trust-remote-code (#6751) Travis Johnson 2024-10-21 20:46:24 -06:00
  • 76a5e13270 [core] move parallel sampling out from vllm core (#9302) youkaichao 2024-10-21 17:31:44 -07:00
  • ef7faad1b8 🐛 Fixup more test failures from memory profiling (#9563) Joe Runde 2024-10-21 19:10:56 -05:00
  • 575dcebe9a [CI] Make format checker error message more user-friendly by using emoji (#9564) Kuntai Du 2024-10-21 18:45:15 -05:00
  • 711f3a7806 [Frontend] Don't log duplicate error stacktrace for every request in the batch (#9023) Wallas Henrique 2024-10-21 18:49:41 -03:00
  • 15713e3b75 [BugFix] Update draft model TP size check to allow matching target TP size (#9394) Nick Hill 2024-10-21 22:14:29 +01:00
  • d621c43df7 [doc] fix format (#9562) youkaichao 2024-10-21 13:54:57 -07:00
  • 9d9186be97 [Frontend] Reduce frequency of client cancellation checking (#7959) Nick Hill 2024-10-21 21:28:10 +01:00
  • 5241aa1494 [Model][Bugfix] Fix batching with multi-image in PixtralHF (#9518) Michael Goin 2024-10-21 14:20:07 -04:00
  • ec6bd6c4c6 [BugFix] Use correct python3 binary in Docker.ppc64le entrypoint (#9492) Varad Ahirwadkar 2024-10-21 23:13:02 +05:30
  • 8ca8954841 [Bugfix][Misc]: fix graph capture for decoder (#9549) yudian0504 2024-10-22 01:33:30 +08:00
  • f6b97293aa [Model] FalconMamba Support (#9325) Dhia Eddine Rhaiem 2024-10-21 20:50:16 +04:00
  • 496e991da8 [Doc] Consistent naming of attention backends (#9498) Thomas Parnell 2024-10-21 16:29:57 +02:00
  • 696b01af8f [CI/Build] Split up decoder-only LM tests (#9488) Cyrus Leung 2024-10-21 12:27:50 +08:00
  • 855e0e6f97 [Frontend][Misc] Goodput metric support (#9338) Andy Dai 2024-10-20 11:39:32 -07:00
  • 4fa3e33349 [Kernel] Support sliding window in flash attention backend (#9403) Chen Zhang 2024-10-20 10:57:52 -07:00
  • 962d2c6349 [Model][Pixtral] Use memory_efficient_attention for PixtralHFVision (#9520) Michael Goin 2024-10-20 01:29:14 -04:00
  • 5b59fe0f08 [Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger (#9530) Chen Zhang 2024-10-19 17:05:02 -07:00
  • 8e3e7f2713 [Model][Pixtral] Optimizations for input_processor_for_pixtral_hf (#9514) Michael Goin 2024-10-19 10:44:29 -04:00
  • 263d8ee150 [Bugfix] Fix missing task for speculative decoding (#9524) Cyrus Leung 2024-10-19 14:49:40 +08:00
  • c5eea3c8ba [Frontend] Support simpler image input format (#9478) Yue Zhang 2024-10-18 23:17:07 -07:00
  • 85dc92fc98 [CI/Build] Configure matcher for actionlint workflow (#9511) Russell Bryant 2024-10-19 02:04:18 -04:00
  • dfd951ed9b [CI/Build] Add error matching for ruff output (#9513) Russell Bryant 2024-10-19 01:42:20 -04:00
  • 82c25151ec [Doc] update gpu-memory-utilization flag docs (#9507) Joe Runde 2024-10-18 22:26:36 -05:00
  • 1325872ec8 [Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily (#9521) Nick Hill 2024-10-19 04:21:01 +01:00
  • 380e18639f 🐛 fix torch memory profiling (#9516) Joe Runde 2024-10-18 20:25:19 -05:00
  • 337ed76671 [Bugfix] Fix offline mode when using mistral_common (#9457) sasha0552 2024-10-19 01:12:32 +00:00
  • 0c9a5258f9 [Kernel] Add env variable to force flashinfer backend to enable tensor cores (#9497) Thomas Parnell 2024-10-19 02:55:48 +02:00
  • d11bf435a0 [MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) Cody Yu 2024-10-18 14:30:55 -07:00
  • 9bb10a7d27 [MISC] Add lora requests to metrics (#9477) Kunjan 2024-10-18 13:50:18 -07:00
  • 3921a2f29e [Model] Support Pixtral models in the HF Transformers format (#9036) Michael Goin 2024-10-18 15:29:56 -04:00
  • 67a7e5ef38 [CI/Build] Add error matching config for mypy (#9512) Russell Bryant 2024-10-18 15:17:53 -04:00
  • 051eaf6db3 [Model] Add user-configurable task for models that support both generation and embedding (#9424) Cyrus Leung 2024-10-19 02:31:58 +08:00
  • 7dbe738d65 [Misc] benchmark: Add option to set max concurrency (#9390) Russell Bryant 2024-10-18 14:15:28 -04:00
  • ae8b633ba3 [Bugfix] Fix offline_inference_with_prefix.py (#9505) Tyler Michael Smith 2024-10-18 12:59:19 -04:00
  • 1bbbcc0b1d [CI/Build] Fix lint errors in mistral tokenizer (#9504) Cyrus Leung 2024-10-19 00:09:35 +08:00
  • 25aeb7d4c9 [BugFix] Fix and simplify completion API usage streaming (#9475) Nick Hill 2024-10-18 15:10:26 +01:00
  • d2b1bf55ec [Frontend][Feature] Add jamba tool parser (#9154) tomeras91 2024-10-18 13:27:48 +03:00
  • 1ffc8a7362 [BugFix] Typing fixes to RequestOutput.prompt and beam search (#9473) Nick Hill 2024-10-18 08:19:53 +01:00
  • 944dd8edaf [CI/Build] Use commit hash references for github actions (#9430) Russell Bryant 2024-10-18 00:54:58 -04:00
  • 154a8ae880 [Qwen2.5] Support bnb quant for Qwen2.5 (#9467) Haoyu Wang 2024-10-18 12:40:14 +08:00
  • de4008e2ab [Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage (#9352) Joe Runde 2024-10-17 21:47:27 -05:00
  • 48138a8415 [BugFix] Stop silent failures on compressed-tensors parsing (#9381) Dipika Sikka 2024-10-17 21:54:00 -04:00
  • 343f8e0905 Support BERTModel (first encoder-only embedding model) (#9056) Robert Shaw 2024-10-17 19:21:01 -04:00
  • bb76538bbd [Hardwware][Neuron] Simplify model load for transformers-neuronx library (#9380) Shashwat Srijan 2024-10-17 15:39:39 -07:00
  • d615b5c9f8 [Bugfix] Print warnings related to mistral_common tokenizer only once (#9468) sasha0552 2024-10-17 21:44:20 +00:00
  • d65049daab [Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script (#9013) Kai Wu 2024-10-17 14:11:11 -07:00
  • eca2c5f7c0 [Bugfix] Fix support for dimension like integers and ScalarType (#9299) bnellnm 2024-10-17 15:08:34 -04:00
  • 0f41fbe5a3 [torch.compile] Fine-grained CustomOp enabling mechanism (#9300) Luka Govedič 2024-10-17 14:36:37 -04:00
  • 7871659abb [Misc] Remove commit id file (#9470) Cyrus Leung 2024-10-18 01:34:37 +08:00
  • a2c71c5405 [CI/Build] remove .github from .dockerignore, add dirty repo check (#9375) v0.6.3.post1 Daniele 2024-10-17 19:25:06 +02:00
  • 81ede99ca4 [Core] Deprecating block manager v1 and make block manager v2 default (#8704) Kuntai Du 2024-10-17 11:38:15 -05:00
  • 5eda21e773 [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support (#9344) Li, Jiang 2024-10-18 00:21:04 +08:00
  • 8e1cddcd44 [TPU] Call torch._sync(param) during weight loading (#9437) Woosuk Kwon 2024-10-17 09:00:11 -07:00
  • 5e443b594f [Bugfix] Allow prefill of assistant response when using mistral_common (#9446) sasha0552 2024-10-17 15:06:37 +00:00
  • 9d30a056e7 [misc] CUDA Time Layerwise Profiler (#8337) Lucas Wilkinson 2024-10-17 10:36:09 -04:00
  • 390be74649 [Misc] Print stack trace using logger.exception (#9461) Cyrus Leung 2024-10-17 21:55:48 +08:00
  • e312e52b44 [Kernel] Add Exllama as a backend for compressed-tensors (#9395) Lucas Wilkinson 2024-10-17 09:48:26 -04:00
  • dbfa8d31d5 Add notes on the use of Slack (#9442) Yuan Tang 2024-10-17 00:46:46 -04:00
  • 92d86da217 [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels (#9391) rasmith 2024-10-16 20:34:06 -05:00
  • c3fab5f769 [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel (#9425) Tyler Michael Smith 2024-10-16 19:46:06 -04:00
  • 776dbd74f1 [CI/Build] mypy: Resolve some errors from checking vllm/engine (#9267) Russell Bryant 2024-10-16 18:55:59 -04:00
  • 8345045833 [Performance][Spec Decode] Optimize ngram lookup performance (#9333) Lily Liu 2024-10-16 12:37:45 -07:00
  • 5b8a1fde84 [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft (#9396) Junhao Li 2024-10-16 12:40:24 -04:00
  • fb60ae9b91 [Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189) Mor Zusman 2024-10-17 00:12:43 +08:00