Commit Graph

1674 Commits

Author SHA1 Message Date
youkaichao
d9a252bc8e [Core][Distributed] add shm broadcast (#5399)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-06-21 05:12:35 +00:00
Jee Li
67005a07bc [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-21 04:46:28 +00:00
Chang Su
c35e4a3dd7 [BugFix] Fix test_phi3v.py (#5725) 2024-06-21 04:45:34 +00:00
Jinzhen Lin
1f5674218f [Kernel] Add punica dimension for Qwen2 LoRA (#5441) 2024-06-20 17:55:41 -07:00
Joshua Rosenkranz
b12518d3cf [Model] MLPSpeculator speculative decoding support (#4947)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
2024-06-20 20:23:12 -04:00
youkaichao
6c5b7af152 [distributed][misc] use fork by default for mp (#5669) 2024-06-20 17:06:34 -07:00
Michael Goin
8065a7e220 [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) 2024-06-20 17:00:13 -06:00
Tyler Michael Smith
3f3b6b2150 [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) 2024-06-20 18:36:10 +00:00
Varun Sundar Rabindranath
a7dcc62086 [Kernel] Update Cutlass int8 kernel configs for SM80 (#5275)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-20 13:33:21 +00:00
Roger Wang
ad137cd111 [Model] Port over CLIPVisionModel for VLMs (#5591) 2024-06-20 11:52:09 +00:00
Varun Sundar Rabindranath
111af1fa2c [Kernel] Update Cutlass int8 kernel configs for SM90 (#5514)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-20 06:37:08 +00:00
Roger Wang
1b2eaac316 [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (#5703) 2024-06-19 23:10:47 -07:00
Cyrus Leung
3730a1c832 [Misc] Improve conftest (#5681) 2024-06-19 19:09:21 -07:00
Kevin H. Luu
949e49a685 [ci] Limit num gpus if specified for A100 (#5694)
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-19 16:30:03 -07:00
Dipika Sikka
4a30d7e3cc [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650) 2024-06-19 18:06:44 -04:00
Rafael Vasquez
e83db9e7e3 [Doc] Update docker references (#5614)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-06-19 15:01:45 -07:00
zifeitong
78687504f7 [Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654) 2024-06-19 13:57:12 -07:00
youkaichao
d571ca0108 [ci][distributed] add tests for custom allreduce (#5689) 2024-06-19 20:16:04 +00:00
Michael Goin
afed90a034 [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (#5688) 2024-06-19 14:41:42 -04:00
Kevin H. Luu
3ee5c4bca5 [ci] Add A100 queue into AWS CI template (#5648)
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-19 08:42:13 -06:00
Cyrus Leung
e9c2732b97 [CI/Build] Add tqdm to dependencies (#5680) 2024-06-19 08:37:33 -06:00
DearPlanet
d8714530d1 [Misc]Add param max-model-len in benchmark_latency.py (#5629) 2024-06-19 18:19:08 +08:00
Isotr0py
7d46c8d378 [Bugfix] Fix sampling_params passed incorrectly in Phi3v example (#5684) 2024-06-19 17:58:32 +08:00
Michael Goin
da971ec7a5 [Model] Add FP8 kv cache for Qwen2 (#5656) 2024-06-19 09:38:26 +00:00
youkaichao
3eea74889f [misc][distributed] use 127.0.0.1 for single-node (#5619) 2024-06-19 08:05:00 +00:00
Hongxia Yang
f758aed0e8 [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices (#5641) 2024-06-18 23:21:29 -07:00
Thomas Parnell
e5150f2c28 [Bugfix] Added test for sampling repetition penalty bug. (#5659)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-19 06:03:55 +00:00
Shukant Pal
59a1eb59c9 [Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628) 2024-06-19 01:46:38 +00:00
Tyler Michael Smith
6820724e51 [Bugfix] Fix w8a8 benchmarks for int8 case (#5643) 2024-06-19 00:33:25 +00:00
Tyler Michael Smith
b23ce92032 [Bugfix] Fix CUDA version check for mma warning suppression (#5642) 2024-06-18 23:48:49 +00:00
milo157
2bd231a7b7 [Doc] Added cerebrium as Integration option (#5553) 2024-06-18 15:56:59 -07:00
Thomas Parnell
8a173382c8 [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (#5639)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-18 14:18:37 -07:00
sergey-tinkoff
07feecde1a [Model] LoRA support added for command-r (#5178) 2024-06-18 11:01:21 -07:00
Kevin H. Luu
19091efc44 [ci] Setup Release pipeline and build release wheels with cache (#5610)
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-18 11:00:36 -07:00
Dipika Sikka
95db455e7f [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542) 2024-06-18 12:45:05 -04:00
Ronen Schaffer
7879f24dcc [Misc] Add OpenTelemetry support (#4687)
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
Kevin H. Luu
13db4369d9 [ci] Deprecate original CI template (#5624)
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-18 14:26:20 +00:00
Roger Wang
4ad7b53e59 [CI/Build][Misc] Update Pytest Marker for VLMs (#5623) 2024-06-18 13:10:04 +00:00
Chang Su
f0cc0e68e3 [Misc] Remove import from transformers logging (#5625) 2024-06-18 12:12:19 +00:00
youkaichao
db5ec52ad7 [bugfix][distributed] improve p2p capability test (#5612)
[bugfix][distributed] do not error if two processes do not agree on p2p capability (#5612)
2024-06-18 07:21:05 +00:00
Kuntai Du
114d7270ff [CI] Avoid naming different metrics with the same name in performance benchmark (#5615) 2024-06-17 21:37:18 -07:00
Cyrus Leung
32c86e494a [Misc] Fix typo (#5618) 2024-06-17 20:58:30 -07:00
youkaichao
8eadcf0b90 [misc][typo] fix typo (#5620) 2024-06-17 20:54:57 -07:00
Joe Runde
5002175e80 [Kernel] Add punica dimensions for Granite 13b (#5559)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-06-18 03:54:11 +00:00
Isotr0py
daef218b55 [Model] Initialize Phi-3-vision support (#4986) 2024-06-17 19:34:33 -07:00
sroy745
fa9e385229 [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131) 2024-06-17 21:29:09 -05:00
zifeitong
26e1188e51 [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606) 2024-06-17 23:16:10 +00:00
Bruce Fontaine
a3e8a05d4c [Bugfix] Fix KV head calculation for MPT models when using GQA (#5142) 2024-06-17 15:26:41 -07:00
youkaichao
e441bad674 [Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584) 2024-06-17 22:08:05 +00:00
youkaichao
1b44aaf4e3 [bugfix][distributed] fix 16 gpus local rank arrangement (#5604) 2024-06-17 21:35:04 +00:00