Commit Graph

  • 9cc373f390 [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) Charlie Fu 2024-09-19 12:37:57 -05:00
  • 76515f303b [Frontend] Use MQLLMEngine for embeddings models too (#8584) Nick Hill 2024-09-19 17:51:06 +01:00
  • 855c8ae2c9 [MISC] remove engine_use_ray in benchmark_throughput.py (#8615) Kunshang Ji 2024-09-19 13:33:20 +08:00
  • c52ec5f034 [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616) Kuntai Du 2024-09-18 22:24:24 -07:00
  • 02c9afa2d0 Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (#8593) Roger Wang 2024-09-18 21:14:28 -07:00
  • 3118f63385 [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (#8545) sroy745 2024-09-18 19:24:15 -07:00
  • 4c34ce8916 [Kernel] Remove marlin moe templating on thread_m_blocks (#8573) Tyler Michael Smith 2024-09-18 21:42:49 -04:00
  • 0d47bf3bf4 [Bugfix] add dead_error property to engine client (#8574) Joe Runde 2024-09-18 16:10:01 -06:00
  • d9cd78eb71 [BugFix] Nonzero exit code if MQLLMEngine startup fails (#8572) Nick Hill 2024-09-18 21:17:55 +01:00
  • db9120cded [Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039) Tyler Michael Smith 2024-09-18 16:05:06 -04:00
  • b3195bc9e4 [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380) Gregory Shtrasberg 2024-09-18 13:41:08 -04:00
  • e18749ff09 [Model] Support Solar Model (#8386) Geun, Lim 2024-09-19 02:04:00 +09:00
  • d65798f78c [Core] zmq: bind only to 127.0.0.1 for local-only usage (#8543) Russell Bryant 2024-09-18 12:10:27 -04:00
  • a8c1d161a7 [Core] *Prompt* logprobs support in Multi-step (#8199) afeldman-nm 2024-09-18 11:38:43 -04:00
  • 7c7714d856 [Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH (#8157) Alexander Matveev 2024-09-18 09:56:58 -04:00
  • 9d104b5beb [CI/Build] Update Ruff version (#8469) Aaron Pham 2024-09-18 07:00:56 -04:00
  • 6ffa3f314c [CI/Build] Avoid CUDA initialization (#8534) Cyrus Leung 2024-09-18 18:38:11 +08:00
  • e351572900 [Misc] Add argument to disable FastAPI docs (#8554) Jiaxin Shan 2024-09-18 02:51:59 -07:00
  • 95965d31b6 [CI/Build] fix Dockerfile.cpu on podman (#8540) Daniele 2024-09-18 04:49:53 +02:00
  • 8110e44529 [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012) Tyler Michael Smith 2024-09-17 19:44:27 -04:00
  • 09deb4721f [CI/Build] Excluding kernels/test_gguf.py from ROCm (#8520) Alexey Kondratiev(AMD) 2024-09-17 19:40:29 -04:00
  • fa0c114fad [doc] improve installation doc (#8550) youkaichao 2024-09-17 16:24:06 -07:00
  • 98f9713399 [Bugfix] Fix TP > 1 for new granite (#8544) Joe Runde 2024-09-17 17:17:08 -06:00
  • 56c3de018c [Misc] Don't dump contents of kvcache tensors on errors (#8527) Nick Hill 2024-09-17 20:24:29 +01:00
  • a54ed80249 [Model] Add mistral function calling format to all models loaded with "mistral" format (#8515) Patrick von Platen 2024-09-17 19:50:37 +02:00
  • 9855b99502 [Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434) chenqianfzh 2024-09-17 08:09:12 -07:00
  • 1009e93c5d [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) sroy745 2024-09-17 07:35:01 -07:00
  • 1b6de8352b [Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495) Isotr0py 2024-09-17 15:34:27 +08:00
  • cbdb252259 [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (#8509) Rui Qiao 2024-09-17 00:06:26 -07:00
  • 99aa4eddaf [torch.compile] register allreduce operations as custom ops (#8526) youkaichao 2024-09-16 22:57:57 -07:00
  • ee2bceaaa6 [Misc][Bugfix] Disable guided decoding for mistral tokenizer (#8521) Roger Wang 2024-09-16 22:22:45 -07:00
  • 1c1bb388e0 [Frontend] Improve Nullable kv Arg Parsing (#8525) Alex Brooks 2024-09-16 22:17:32 -06:00
  • 546034b466 [refactor] remove triton based sampler (#8524) Simon Mo 2024-09-16 20:04:48 -07:00
  • cca61642e0 [Bugfix] Fix 3.12 builds on main (#8510) Joe Runde 2024-09-16 18:01:45 -06:00
  • 5ce45eb54d [misc] small qol fixes for release process (#8517) Simon Mo 2024-09-16 15:11:27 -07:00
  • 5478c4b41f [perf bench] set timeout to debug hanging (#8516) Simon Mo 2024-09-16 14:30:02 -07:00
  • 47f5e03b5b [Bugfix] Bind api server port before starting engine (#8491) Kevin Lin 2024-09-16 15:56:28 -05:00
  • 2759a43a26 [doc] update doc on testing and debugging (#8514) youkaichao 2024-09-16 12:10:23 -07:00
  • 5d73ae49d6 [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270) Luka Govedič 2024-09-16 14:52:40 -04:00
  • 781e3b9a42 [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (#8506) sasha0552 2024-09-16 18:15:57 +00:00
  • acd5511b6d [BugFix] Fix clean shutdown issues (#8492) Nick Hill 2024-09-16 17:33:46 +01:00
  • 837c1968f9 [Frontend] Expose revision arg in OpenAI server (#8501) lewtun 2024-09-16 17:55:26 +02:00
  • a091e2da3e [Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032) ElizaWszola 2024-09-16 17:47:19 +02:00
  • fc990f9795 [Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel (#8357) Isotr0py 2024-09-16 06:51:44 +08:00
  • 3724d5f6b5 [Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (#8490) Chris 2024-09-15 06:20:05 +02:00
  • 50e9ec41fc [TPU] Implement multi-step scheduling (#8489) Woosuk Kwon 2024-09-14 16:58:31 -07:00
  • 47790f3e32 [torch.compile] add a flag to disable custom op (#8488) youkaichao 2024-09-14 13:07:16 -07:00
  • a36e070dad [torch.compile] fix functionalization (#8480) youkaichao 2024-09-14 09:46:04 -07:00
  • 8a0cf1ddc3 [Model] support minicpm3 (#8297) ywfang 2024-09-14 22:50:26 +08:00
  • 1ef0d2efd0 [Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310) Charlie Fu 2024-09-13 19:01:11 -05:00
  • 851725202a [Hardware][intel GPU] bump up ipex version to 2.3 (#8365) Kunshang Ji 2024-09-14 07:54:34 +08:00
  • 9ba0817ff1 bump version to v0.6.1.post2 (#8473) v0.6.1.post2 Simon Mo 2024-09-13 11:35:00 -07:00
  • 18e9e1f7b3 [HotFix] Fix final output truncation with stop string + streaming (#8468) Nick Hill 2024-09-13 19:31:12 +01:00
  • f57092c00b [Doc] Add oneDNN installation to CPU backend documentation (#8467) Isotr0py 2024-09-14 02:06:30 +08:00
  • a84e598e21 [CI/Build] Reorganize models tests (#7820) Cyrus Leung 2024-09-14 01:20:06 +08:00
  • 0a4806f0a9 [plugin][torch.compile] allow to add custom compile backend (#8445) youkaichao 2024-09-13 09:32:42 -07:00
  • ecd7a1d5b6 [Installation] Gate FastAPI version for Python 3.8 (#8456) Cyrus Leung 2024-09-14 00:02:26 +08:00
  • a2469127db [misc][ci] fix quant test (#8449) youkaichao 2024-09-13 02:20:14 -07:00
  • 06311e2956 [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (#8442) Jee Jee Li 2024-09-13 15:58:28 +08:00
  • cab69a15e4 [doc] recommend pip instead of conda (#8446) youkaichao 2024-09-12 23:52:41 -07:00
  • 9b4a3b235e [CI/Build] Enable InternVL2 PP test only on single node (#8437) Isotr0py 2024-09-13 14:35:20 +08:00
  • acda0b35d0 bump version to v0.6.1.post1 (#8440) v0.6.1.post1 Simon Mo 2024-09-12 21:39:49 -07:00
  • ba77527955 [bugfix] torch profiler bug for single gpu with GPUExecutor (#8354) William Lin 2024-09-12 21:30:00 -07:00
  • 6821020109 [Bugfix] Fix async log stats (#8417) Alexander Matveev 2024-09-12 23:48:59 -04:00
  • 8427550488 [CI/Build] Update pixtral tests to use JSON (#8436) Cyrus Leung 2024-09-13 11:47:52 +08:00
  • 3f79bc3d1a [Bugfix] Bump fastapi and pydantic version (#8435) Cyrus Leung 2024-09-13 11:21:42 +08:00
  • 40c396533d [Bugfix] Mapping physical device indices for e2e test utils (#8290) shangmingc 2024-09-13 11:06:28 +08:00
  • 5ec9c0fb3c [Core] Factor out input preprocessing to a separate class (#7329) Cyrus Leung 2024-09-13 10:56:13 +08:00
  • 8f44a92d85 [BugFix] fix group_topk (#8430) Dipika Sikka 2024-09-12 21:23:42 -04:00
  • 360ddbd37e [Misc] Update Pixtral example (#8431) Roger Wang 2024-09-12 17:31:18 -07:00
  • a480939e8e [Bugfix] Fix weight loading issue by rename variable. (#8293) Wenxiang 2024-09-13 07:25:00 +08:00
  • d31174a4e1 [Hotfix][Pixtral] Fix multiple images bugs (#8415) Patrick von Platen 2024-09-13 00:21:51 +02:00
  • b61bd98f90 [CI/Build] Disable multi-node test for InternVL2 (#8428) Roger Wang 2024-09-12 15:05:35 -07:00
  • c16369455f [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models (#8425) Roger Wang 2024-09-12 14:06:51 -07:00
  • 019877253b [Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427) Alexander Matveev 2024-09-12 17:01:50 -04:00
  • 551ce01078 [Core] Add engine option to return only deltas or final output (#7381) Nick Hill 2024-09-12 20:02:00 +01:00
  • a6c0f3658d [multi-step] add flashinfer backend (#7928) William Lin 2024-09-12 11:16:22 -07:00
  • f2e263b801 [Bugfix] Offline mode fix (#8376) Joe Runde 2024-09-12 12:11:57 -06:00
  • 1f0c75afa9 [BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (#8423) Luis Vega 2024-09-12 11:10:11 -07:00
  • 8a23e93302 [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance (#8403) WANGWEI 2024-09-13 01:47:42 +08:00
  • c6202daeed [Model] Support multiple images for qwen-vl (#8247) Alex Brooks 2024-09-12 11:10:54 -06:00
  • e56bf27741 [Bugfix] Fix InternVL2 inference with various num_patches (#8375) Isotr0py 2024-09-13 01:10:35 +08:00
  • 520ca380ae [Hotfix][VLM] Fixing max position embeddings for Pixtral (#8399) Roger Wang 2024-09-12 09:28:37 -07:00
  • 7de49aa86c [torch.compile] hide slicing under custom op for inductor (#8384) youkaichao 2024-09-12 00:11:55 -07:00
  • 42ffba11ad [Misc] Use RoPE cache for MRoPE (#8396) Woosuk Kwon 2024-09-11 23:13:14 -07:00
  • 295c4730a8 [Misc] Raise error when using encoder/decoder model with cpu backend (#8355) Kevin Lin 2024-09-12 00:45:24 -05:00
  • 1bf2dd9df0 [Gemma2] add bitsandbytes support for Gemma2 (#8338) Blueyo0 2024-09-12 12:53:12 +08:00
  • 5a60699c45 [Bugfix]: Fix the logic for deciding if tool parsing is used (#8366) tomeras91 2024-09-12 06:55:30 +03:00
  • b6c75e1cf2 Fix the AMD weight loading tests (#8390) Michael Goin 2024-09-11 23:35:33 -04:00
  • b71c956deb [TPU] Use Ray for default distributed backend (#8389) Woosuk Kwon 2024-09-11 20:31:51 -07:00
  • f842a7aff1 [misc] remove engine_use_ray (#8126) youkaichao 2024-09-11 18:23:36 -07:00
  • a65cb16067 [MISC] Dump model runner inputs when crashing (#8305) Cody Yu 2024-09-11 18:12:25 -07:00
  • 3fd2b0d21c Bump version to v0.6.1 (#8379) v0.6.1 Simon Mo 2024-09-11 14:42:11 -07:00
  • d394787e52 Pixtral (#8377) Patrick von Platen 2024-09-11 23:41:55 +02:00
  • 775f00f81e [Speculative Decoding] Test refactor (#8317) Lily Liu 2024-09-11 14:07:34 -07:00
  • 8baa454937 [Misc] Move device options to a single place (#8322) Aarni Koskela 2024-09-11 23:25:58 +03:00
  • 73202dbe77 [Kernel][Misc] register ops to prevent graph breaks (#6917) bnellnm 2024-09-11 15:52:19 -04:00
  • 7015417fd4 [Bugfix] Add missing attributes in mistral tokenizer (#8364) Cyrus Leung 2024-09-12 02:36:54 +08:00
  • aea02f30de [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation (#8373) Alexey Kondratiev(AMD) 2024-09-11 14:31:41 -04:00
  • 0b952af458 [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257) Li, Jiang 2024-09-12 00:46:46 +08:00