[Feature] Add --distributed-timeout-seconds CLI option (#36047)
Signed-off-by: Shiyan Deng <dsy842974287@meta.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
This commit is contained in:
@@ -234,9 +234,15 @@ class ParallelConfig:
|
||||
"""distributed node rank for multi-node distributed
|
||||
inference when distributed_executor_backend is mp."""
|
||||
nnodes: int = 1
|
||||
"""num of nodes for multi-node distributed
|
||||
"""num of nodes for multi-node distributed
|
||||
inference when distributed_executor_backend is mp."""
|
||||
|
||||
distributed_timeout_seconds: int | None = None
|
||||
"""Timeout in seconds for distributed operations (e.g., init_process_group).
|
||||
If set, this value is passed to torch.distributed.init_process_group as the
|
||||
timeout parameter. If None, PyTorch's default timeout is used (600s for NCCL).
|
||||
Increase this for multi-node setups where model downloads may be slow."""
|
||||
|
||||
world_size: int = Field(init=False)
|
||||
"""world_size is TPxPP, it affects the number of workers we create."""
|
||||
|
||||
|
||||
Reference in New Issue
Block a user