Commit Graph

  • 396d92d5e0 [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) Alexander Matveev 2024-07-21 19:41:42 -04:00
  • 25e778aa16 [Model] Refactor and decouple phi3v image embedding (#6621) Isotr0py 2024-07-22 07:07:58 +08:00
  • b6df37f943 [Misc] Remove abused noqa (#6619) Woosuk Kwon 2024-07-21 08:47:04 -07:00
  • 14f91fe67c [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485) sroy745 2024-07-20 23:58:58 -07:00
  • d7f4178dd9 [Frontend] Move chat utils (#6602) Cyrus Leung 2024-07-21 08:38:17 +08:00
  • 082ecd80d5 [ Bugfix ] Fix AutoFP8 fp8 marlin (#6609) Robert Shaw 2024-07-20 19:25:56 -04:00
  • f952bbc8ff [Misc] Fix input_scale typing in w8a8_utils.py (#6579) Michael Goin 2024-07-20 19:11:13 -04:00
  • 9364f74eee [ Kernel ] Enable fp8-marlin for fbgemm-fp8 models (#6606) Robert Shaw 2024-07-20 14:50:10 -04:00
  • 06d6c5fe9f [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543) Matt Wong 2024-07-20 11:39:07 -05:00
  • 683e3cb9c4 [ Misc ] fbgemm checkpoints (#6559) Robert Shaw 2024-07-20 12:36:57 -04:00
  • 9042d68362 [Misc] Consolidate and optimize logic for building padded tensors (#6541) Cyrus Leung 2024-07-20 12:17:24 +08:00
  • 3f8d42c81f Pipeline Parallel: Guard for KeyErrors at request abort (#6587) Travis Johnson 2024-07-19 20:18:19 -06:00
  • 7bd82002ae [Core] Allow specifying custom Executor (#6557) Antoni Baum 2024-07-19 18:25:06 -07:00
  • 2e26564259 [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593) Varun Sundar Rabindranath 2024-07-19 21:15:26 -04:00
  • e81522e879 [build] add ib in image for out-of-the-box infiniband support (#6599) youkaichao 2024-07-19 17:16:57 -07:00
  • 45ceb85a0c [Docs] Update PP docs (#6598) Murali Andoorveedu 2024-07-19 19:38:21 -04:00
  • 4cc24f01b1 [ Kernel ] Enable Dynamic Per Token fp8 (#6547) Robert Shaw 2024-07-19 19:08:15 -04:00
  • 07eb6f19f3 [bugfix][distributed] fix multi-node bug for shared memory (#6597) youkaichao 2024-07-19 15:34:34 -07:00
  • f0bbfaf917 [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578) Thomas Parnell 2024-07-19 23:01:03 +02:00
  • 30efe41532 [Docs] Update docs for wheel location (#6580) Simon Mo 2024-07-19 12:14:11 -07:00
  • 9ed82e7074 [Misc] Small perf improvements (#6520) Antoni Baum 2024-07-19 12:10:56 -07:00
  • 51f8aa90ad [Bugfix][Frontend] remove duplicate init logger (#6581) Daniele 2024-07-19 19:16:27 +02:00
  • a5314e8698 [Model] RowParallelLinear: pass bias to quant_method.apply (#6327) Thomas Parnell 2024-07-19 15:15:22 +02:00
  • a921e86392 [BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369) Woo-Yeon Lee 2024-07-19 22:01:09 +09:00
  • 6366efc67b [Bugfix][Frontend] Fix missing /metrics endpoint (#6463) Cyrus Leung 2024-07-19 11:55:13 +08:00
  • dbe5588554 [ Misc ] non-uniform quantization via compressed-tensors for Llama (#6515) Robert Shaw 2024-07-18 22:39:18 -04:00
  • d4201e06d5 [Bugfix] Make spec. decode respect per-request seed. (#6034) Thomas Parnell 2024-07-19 04:22:08 +02:00
  • b5672a112c [Core] Multiprocessing Pipeline Parallel support (#6130) Nick Hill 2024-07-18 19:15:52 -07:00
  • c5df56f88b Add support for a rope extension method (#6553) Simon Mo 2024-07-18 18:53:03 -07:00
  • 1689219ebf [CI/Build] Build on Ubuntu 20.04 instead of 22.04 (#6517) Tyler Michael Smith 2024-07-18 20:29:25 -04:00
  • 4ffffccb7e [Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm (#6552) Tyler Michael Smith 2024-07-18 19:52:22 -04:00
  • f53b8f0d05 [ci][test] add correctness test for cpu offloading (#6549) youkaichao 2024-07-18 16:41:06 -07:00
  • 2d4733ba2d Fix PR comment bot (#6554) Kevin H. Luu 2024-07-18 14:48:29 -07:00
  • 15c6a079b1 [Model] Support Mistral-Nemo (#6548) Michael Goin 2024-07-18 16:31:50 -04:00
  • ecdb462c24 [ci] Reword Github bot comment (#6534) Kevin H. Luu 2024-07-18 08:01:45 -07:00
  • 58ca663224 [ Misc ] Improve Min Capability Checking in compressed-tensors (#6522) Robert Shaw 2024-07-18 10:39:12 -04:00
  • 4634c8728b [TPU] Refactor TPU worker & model runner (#6506) Woosuk Kwon 2024-07-18 01:34:16 -07:00
  • c8a7d51c49 [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501) Noam Gat 2024-07-18 10:47:13 +03:00
  • e2fbaee725 [BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227) Nick Hill 2024-07-18 00:13:30 -07:00
  • 8a74c68bd1 [Misc] Minor patch for draft model runner (#6523) Cody Yu 2024-07-17 23:06:21 -07:00
  • 61e592747c [Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032) Rui Qiao 2024-07-17 22:27:09 -07:00
  • d25877dd9b [BugFix] Avoid secondary error in ShmRingBuffer destructor (#6530) Nick Hill 2024-07-17 22:24:43 -07:00
  • 1c27d25fb5 [core][model] yet another cpu offload implementation (#6496) youkaichao 2024-07-17 20:54:35 -07:00
  • 18fecc3559 [ Kernel ] Fp8 Channelwise Weight Support (#6487) Robert Shaw 2024-07-17 23:18:13 -04:00
  • b5af8c223c [Model] Pipeline parallel support for Mixtral (#6516) Cody Yu 2024-07-17 19:26:04 -07:00
  • b5241e41d9 [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511) Varun Sundar Rabindranath 2024-07-17 21:38:35 -04:00
  • e76466dde2 [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) Alexander Matveev 2024-07-17 17:30:28 -04:00
  • 5f0b9933e6 [Bugfix] Fix Ray Metrics API usage (#6354) Antoni Baum 2024-07-17 12:40:10 -07:00
  • a38524f338 [DOC] - Add docker image to Cerebrium Integration (#6510) milo157 2024-07-17 13:22:53 -04:00
  • 2fa4623d9e [Core] Refactor _prepare_model_input_tensors - take 2 (#6164) Cody Yu 2024-07-17 09:37:16 -07:00
  • a9a2e74d21 [Misc] Use torch.Tensor for type annotation (#6505) Woosuk Kwon 2024-07-17 06:01:10 -07:00
  • e09ce759aa [TPU] Remove multi-modal args in TPU backend (#6504) Woosuk Kwon 2024-07-17 04:02:53 -07:00
  • 5fa6e9876e [Bugfix] Fix for multinode crash on 4 PP (#6495) Murali Andoorveedu 2024-07-17 04:25:10 -04:00
  • 5bf35a91e4 [Doc][CI/Build] Update docs and tests to use vllm serve (#6431) Cyrus Leung 2024-07-17 15:43:21 +08:00
  • a19e8d3726 [Misc][Speculative decoding] Typos and typing fixes (#6467) shangmingc 2024-07-17 15:17:07 +08:00
  • 10383887e0 [ROCm] Cleanup Dockerfile and remove outdated patch (#6482) Hongxia Yang 2024-07-17 01:47:02 -04:00
  • 1d094fd7c0 [Distributed][PP] only create embedding & lm head when necessary (#6455) Wushi Dong 2024-07-16 19:20:26 -07:00
  • ce37be7ba0 [misc][distributed] add seed to dummy weights (#6491) youkaichao 2024-07-16 19:16:34 -07:00
  • 7f62077af5 [misc][distributed] improve tests (#6488) youkaichao 2024-07-16 17:35:52 -07:00
  • 09c2eb85dd [ci][distributed] add pipeline parallel correctness test (#6410) youkaichao 2024-07-16 15:44:22 -07:00
  • 978aed5300 [Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081) Michael Goin 2024-07-16 18:31:32 -04:00
  • 160e1d8c99 [Misc] Log spec decode metrics (#6454) Cody Yu 2024-07-16 13:37:10 -07:00
  • 94162beb9f [Doc] Fix the lora adapter path in server startup script (#6230) Jiaxin Shan 2024-07-16 10:11:04 -07:00
  • c467dff24f [Hardware][TPU] Support MoE with Pallas GMM kernel (#6457) Woosuk Kwon 2024-07-16 09:56:28 -07:00
  • 9f4ccec761 [doc][misc] remind to cancel debugging environment variables (#6481) youkaichao 2024-07-16 09:45:30 -07:00
  • 38ef94888a [CI/Build] Remove "boardwalk" image asset (#6460) Cyrus Leung 2024-07-16 23:59:36 +08:00
  • 2bb0489cb3 [Core] Use numpy to speed up padded token processing (#6442) Peng Guanwen 2024-07-16 23:13:25 +08:00
  • 7508a3dc34 [Misc] Fix typos in spec. decode metrics logging. (#6470) Thomas Parnell 2024-07-16 15:55:15 +02:00
  • 7a3d2a5b95 [Frontend] Support for chat completions input in the tokenize endpoint (#5923) sasha0552 2024-07-16 12:18:09 +00:00
  • d97011512e [CI/Build] vLLM cache directory for images (#6444) Cyrus Leung 2024-07-16 14:12:25 +08:00
  • 37d776606f [Docs] Announce 5th meetup (#6458) Woosuk Kwon 2024-07-15 21:04:58 -07:00
  • d92b3c5cde [Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests (#6419) Joe 2024-07-15 18:54:15 -07:00
  • 9ad32dacd9 [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425) Mor Zusman 2024-07-16 04:32:55 +03:00
  • d6f3b3d5c4 Pin sphinx-argparse version (#6453) Kevin H. Luu 2024-07-15 18:26:11 -07:00
  • 4552e37b55 [CI/Build][TPU] Add TPU CI test (#6277) Woosuk Kwon 2024-07-15 14:31:16 -07:00
  • ec9933f4a5 [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod (#6289) Woosuk Kwon 2024-07-15 12:02:14 -07:00
  • 3dee97b05f [Docs] Add Google Cloud to sponsor list (#6450) Woosuk Kwon 2024-07-15 11:58:10 -07:00
  • 4cf256ae7f [misc][distributed] fix pp missing layer condition (#6446) v0.5.2 youkaichao 2024-07-15 10:32:35 -07:00
  • 64fdc08c72 bump version to v0.5.2 (#6433) Simon Mo 2024-07-15 10:27:40 -07:00
  • 4ef95b0f06 [Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF (#6409) Thomas Parnell 2024-07-15 19:14:49 +02:00
  • eaec4b9153 [Bugfix] Add custom Triton cache manager to resolve MoE MP issue (#6140) Thomas Parnell 2024-07-15 19:12:47 +02:00
  • a63a4c6341 [Misc] Use 0.0.9 version for flashinfer (#6447) Pernekhan Utemuratov 2024-07-15 10:10:26 -07:00
  • c8fd97f26d [Kernel] Use CUTLASS kernels for the FP8 layers with Bias (#6270) Tyler Michael Smith 2024-07-15 13:05:52 -04:00
  • 94b82e8c18 [doc][distributed] add suggestion for distributed inference (#6418) youkaichao 2024-07-15 09:45:51 -07:00
  • 6ae1597ddf [VLM] Minor space optimization for ClipVisionModel (#6436) Roger Wang 2024-07-15 02:29:51 -07:00
  • 22e79ee8f3 [doc][misc] doc update (#6439) youkaichao 2024-07-14 23:33:25 -07:00
  • de19916314 [Bugfix] Convert image to RGB by default (#6430) Cyrus Leung 2024-07-15 13:39:15 +08:00
  • 69672f116c [core][distributed] simplify code to support pipeline parallel (#6406) youkaichao 2024-07-14 21:20:51 -07:00
  • 44874a0bf9 [Doc] add env docs for flashinfer backend (#6437) DefTruth 2024-07-15 12:16:51 +08:00
  • b47008b4d2 [BugFix] BatchResponseData body should be optional (#6345) zifeitong 2024-07-14 21:06:09 -07:00
  • 9bfece89fd Add FUNDING.yml (#6435) Simon Mo 2024-07-14 20:36:16 -07:00
  • 32c9d7f765 Report usage for beam search (#6404) Simon Mo 2024-07-14 19:37:35 -07:00
  • ccb20db8bd [Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' (#6428) Fish 2024-07-15 10:27:01 +08:00
  • a754dc2cb9 [CI/Build] Cross python wheel (#6394) Robert Shaw 2024-07-14 21:54:46 -04:00
  • 61e85dbad8 [Doc] xpu backend requires running setvars.sh (#6393) Robert Cohn 2024-07-14 20:10:11 -04:00
  • dbfe254eda [Feature] vLLM CLI (#5090) Ethan Xu 2024-07-14 15:36:43 -07:00
  • 73030b7dae [ Misc ] Enable Quantizing All Layers of DeekSeekv2 (#6423) Robert Shaw 2024-07-14 17:38:42 -04:00
  • ccd3c04571 [ci][build] fix commit id (#6420) youkaichao 2024-07-14 07:16:21 -07:00
  • 9dad5cc859 [Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace (#6384) Tyler Michael Smith 2024-07-14 09:37:19 -04:00
  • 6ef3bf912c Remove unnecessary trailing period in spec_decode.rst (#6405) Yuan Tang 2024-07-14 02:58:09 -05:00