Commit Graph

151 Commits

Author SHA1 Message Date
youkaichao
05826c887b [misc] fix custom allreduce p2p cache file generation (#7853) 2024-08-26 15:02:25 -07:00
youkaichao
d95cc0a55c [core][misc] update libcudart finding (#7620)
Co-authored-by: cjackal <44624812+cjackal@users.noreply.github.com>
2024-08-16 23:01:35 -07:00
bnellnm
e680349994 [Bugfix] Fix custom_ar support check (#7617) 2024-08-16 19:05:49 -07:00
Woosuk Kwon
59edd0f134 [Bugfix][CI] Import ray under guard (#7486) 2024-08-13 17:12:58 -07:00
Woosuk Kwon
a08df8322e [TPU] Support multi-host inference (#7457) 2024-08-13 16:31:20 -07:00
youkaichao
639159b2a6 [distributed][misc] add specialized method for cuda platform (#7249) 2024-08-07 08:54:52 -07:00
youkaichao
16a1cc9bb2 [misc][distributed] improve libcudart.so finding (#7127) 2024-08-04 11:31:51 -07:00
Woosuk Kwon
fad5576c58 [TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856) 2024-07-27 10:28:33 -07:00
Woosuk Kwon
d09b94ca58 [TPU] Support collective communications in XLA devices (#6813) 2024-07-27 01:45:57 +00:00
youkaichao
740374d456 [core][distributed] fix zmq hang (#6759) 2024-07-24 17:37:12 -07:00
Woosuk Kwon
b6df37f943 [Misc] Remove abused noqa (#6619) 2024-07-21 23:47:04 +08:00
youkaichao
07eb6f19f3 [bugfix][distributed] fix multi-node bug for shared memory (#6597) 2024-07-19 15:34:34 -07:00
Nick Hill
d25877dd9b [BugFix] Avoid secondary error in ShmRingBuffer destructor (#6530) 2024-07-17 22:24:43 -07:00
youkaichao
7f62077af5 [misc][distributed] improve tests (#6488) 2024-07-16 17:35:52 -07:00
Cyrus Leung
d97011512e [CI/Build] vLLM cache directory for images (#6444) 2024-07-15 23:12:25 -07:00
youkaichao
2b0fb53481 [distributed][misc] be consistent with pytorch for libcudart.so (#6346)
[distributed][misc] keep consistent with how pytorch finds libcudart.so (#6346)
2024-07-11 19:35:17 -07:00
youkaichao
da78caecfa [core][distributed] zmq fallback for broadcasting large objects (#6183)
[core][distributed] add zmq fallback for broadcasting large objects (#6183)
2024-07-09 18:49:11 -07:00
youkaichao
614aa51203 [misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) 2024-06-30 20:07:34 -07:00
Cyrus Leung
99397da534 [CI/Build] Add TP test for vision models (#5892) 2024-06-29 15:45:54 +00:00
youkaichao
515080ad2f [bugfix][distributed] fix shm broadcast when the queue size is full (#5801) 2024-06-25 21:56:02 -07:00
Matt Wong
dd793d1de5 [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422) 2024-06-25 15:56:15 -07:00
Murali Andoorveedu
5d4d90536f [Distributed] Add send and recv helpers (#5719) 2024-06-23 14:42:28 -07:00
youkaichao
832ea88fcb [core][distributed] improve shared memory broadcast (#5754) 2024-06-22 10:00:43 -07:00
youkaichao
d9a252bc8e [Core][Distributed] add shm broadcast (#5399)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-06-21 05:12:35 +00:00
youkaichao
6c5b7af152 [distributed][misc] use fork by default for mp (#5669) 2024-06-20 17:06:34 -07:00
youkaichao
db5ec52ad7 [bugfix][distributed] improve p2p capability test (#5612)
[bugfix][distributed] do not error if two processes do not agree on p2p capability (#5612)
2024-06-18 07:21:05 +00:00
Cyrus Leung
0e9164b40a [mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
youkaichao
f5bb85b435 [Core][Distributed] improve p2p cache generation (#5528) 2024-06-14 14:47:45 -07:00
Antoni Baum
50eed24d25 Add cuda_device_count_stateless (#5473)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
2024-06-13 16:06:49 -07:00
youkaichao
ea3890a5f0 [Core][Distributed] code deduplication in tp&pp with coordinator(#5293)
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
2024-06-12 17:27:08 -07:00
youkaichao
c4bd03c7c5 [Core][Distributed] add same-node detection (#5369) 2024-06-11 10:53:59 -07:00
youkaichao
c81da5f56d [misc][typo] fix typo (#5372) 2024-06-10 09:51:02 +00:00
bnellnm
5467ac3196 [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) 2024-06-09 16:23:30 -04:00
youkaichao
594392d27a [Core][Distributed] improve p2p access check (#4992) 2024-05-29 11:29:07 +00:00
youkaichao
5bd3c65072 [Core][Optimization] remove vllm-nccl (#5091) 2024-05-29 05:13:52 +00:00
Murali Andoorveedu
5eda2ea02a [Core][1/N] Support send/recv in PyNCCL Groups (#4988)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-05-23 09:54:48 -07:00
youkaichao
702bee461f [Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754) 2024-05-12 17:47:59 -07:00
youkaichao
4e12131089 [Core][Test] fix function name typo in custom allreduce (#4750) 2024-05-10 15:14:40 -07:00
youkaichao
208b71bcc1 [Core][Distributed] refactor pynccl (#4591)
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
2024-05-09 19:48:43 -07:00
youkaichao
5b8a7c1cb0 [Misc] centralize all usage of environment variables (#4548) 2024-05-02 11:13:25 -07:00
youkaichao
2a85f93007 [Core][Distributed] enable multiple tp group (#4512)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-02 04:28:21 +00:00
youkaichao
6ef09b08f8 [Core][Distributed] fix pynccl del error (#4508) 2024-05-01 15:23:06 -07:00
SangBin Cho
a88081bf76 [CI] Disable non-lazy string operation on logging (#4326)
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
2024-04-26 00:16:58 -07:00
youkaichao
3cd9b5bb2d [Core][Distributed] use existing torch.cuda.device (#4318)
[Core][Distributed] use existing torch.cuda.device context manager (#4318)
2024-04-24 09:00:20 -07:00
youkaichao
91f50a6fe2 [Core][Distributed] use cpu/gloo to initialize pynccl (#4248) 2024-04-23 18:32:19 -07:00
SangBin Cho
0ae11f78ab [Mypy] Part 3 fix typing for nested directories for most of directory (#4161) 2024-04-22 21:32:44 -07:00
youkaichao
747b1a7147 [Core][Distributed] fix _is_full_nvlink detection (#4233) 2024-04-21 23:04:16 -07:00
Adam Tilghman
8f9c28fd40 [Bugfix] Fix CustomAllreduce nvlink topology detection (#3974)
[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) (#4159)
2024-04-18 15:32:47 -07:00
youkaichao
6dc1fc9cfe [Core] nccl integrity check and test (#4155)
[Core] Add integrity check during initialization; add test for it (#4155)
2024-04-17 22:28:52 -07:00
youkaichao
2cd6b4f362 [Core] avoid too many cuda context by caching p2p test (#4021) 2024-04-13 23:40:21 -07:00