[CI/Build] Add markdown linter (#11857)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
This commit is contained in:
Rafael Vasquez
2025-01-12 03:17:13 -05:00
committed by GitHub
parent b25cfab9a0
commit 43f3d9e699
49 changed files with 585 additions and 560 deletions

View File

@@ -48,6 +48,7 @@ If vLLM crashes and the error trace captures it somewhere around `self.graph.rep
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
(troubleshooting-incorrect-hardware-driver)=
## Incorrect hardware/driver
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
@@ -118,13 +119,13 @@ dist.destroy_process_group()
If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:
```console
$ NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
```
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
```console
$ NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py
NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py
```
If the script runs successfully, you should see the message `sanity check is successful!`.
@@ -141,6 +142,7 @@ Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup
```
(troubleshooting-python-multiprocessing)=
## Python multiprocessing
### `RuntimeError` Exception