harness: add fire_b200_cuda_test + check_b200_cuda, update README
Two new turnkey harness scripts for .cu tests: - fire_b200_cuda_test: compile+run+poll, kills everything first, deletes old logs, one test at a time, screen-based, timeout - check_b200_cuda: peek at running test log, or kill hung test README updated with CUDA harness documentation. Removed janky tests/run_cuda_test.sh.
This commit is contained in:
45
README.md
45
README.md
@@ -196,23 +196,52 @@ dsv4/
|
||||
### The non-negotiables
|
||||
|
||||
- **NEVER edit on the B200.** Always: edit locally → commit → push → pull on B200 → test.
|
||||
- **ALWAYS use the test harness** (`fire_b200_test`, `run_test.sh`, `check_log.sh`). Never raw SSH+nohup. Nohup does not survive SSH drops; screen sessions do.
|
||||
- **NEVER raw SSH + direct command.** Always use the test harness scripts. They handle: killing hung processes, deleting stale logs, screen sessions that survive SSH drops, timeouts for hung kernels, and GPU cleanup.
|
||||
- **ALWAYS verify hd=64 regression** (cos ~0.999998) after every FMHA change. If it regresses, the change is wrong. Revert.
|
||||
- **NEVER touch drivers, kernels, firmware, or system packages** on the B200.
|
||||
- **NEVER delete test files** in `tests/unit/` without explicit approval.
|
||||
|
||||
### Local → B200 cycle
|
||||
### Two harnesses: Python and CUDA
|
||||
|
||||
| Harness | For | Script | Screen name | Log file |
|
||||
|---|---|---|---|---|
|
||||
| Python | `test_*.py` files | `fire_b200_test` | `kernel-test` | `/tmp/kernel-test.log` |
|
||||
| CUDA | `test_*.cu` files | `fire_b200_cuda_test` | `cuda-test` | `/tmp/cuda-test.log` |
|
||||
|
||||
Both harnesses follow the same discipline:
|
||||
1. **Kill everything first** — old screen sessions, hanging GPU processes, stale binaries
|
||||
2. **Delete all logs** — never debug from a previous run's log
|
||||
3. **Clean git + pull** — no uncommitted B200 state
|
||||
4. **Run in screen** — survives SSH drops, has a timeout
|
||||
5. **One test at a time** — no parallel launches, ever
|
||||
|
||||
### Python test (one command)
|
||||
|
||||
```bash
|
||||
# 1. Edit locally, commit, push
|
||||
cd ~/dev/nvfp4-megamoe-kernel
|
||||
git add -A && git commit -m "description" && git push origin master
|
||||
|
||||
# 2. One-command test (auto-pushes, runs, dumps log)
|
||||
# From local machine — auto-pushes, runs, polls, dumps log
|
||||
~/.openclaw/workspace/fire_b200_test tests/unit/test_fmha_v3_stage_c.py
|
||||
```
|
||||
|
||||
### Manual B200 cycle
|
||||
### CUDA test (one command)
|
||||
|
||||
```bash
|
||||
# From local machine — compiles with nvcc, runs, polls, dumps log
|
||||
# Default timeout: 60s. Pass a second arg for custom timeout.
|
||||
~/.openclaw/workspace/fire_b200_cuda_test tests/unit/test_fmha_sm100_standalone.cu
|
||||
~/.openclaw/workspace/fire_b200_cuda_test tests/unit/test_tmem_minimal.cu 30
|
||||
```
|
||||
|
||||
### Check on a running CUDA test
|
||||
|
||||
```bash
|
||||
# Show current log + screen status
|
||||
~/.openclaw/workspace/check_b200_cuda
|
||||
|
||||
# Kill a hung test + show the log
|
||||
~/.openclaw/workspace/check_b200_cuda kill
|
||||
```
|
||||
|
||||
### Manual B200 cycle (emergency only)
|
||||
|
||||
```bash
|
||||
ssh root@<B200>
|
||||
|
||||
@@ -1,75 +0,0 @@
|
||||
#!/bin/bash
|
||||
# Run the standalone CUDA FMHA test on the B200 using screen sessions.
|
||||
# Follows the test harness pattern from README.md.
|
||||
# Usage: bash tests/run_cuda_test.sh
|
||||
set -e
|
||||
|
||||
B200="root@45.76.247.107"
|
||||
SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=4"
|
||||
PASS='6)Jr)B@dcX[mN?dx'
|
||||
REPO_DIR="/root/dsv4-nvfp4-workspace/kernel"
|
||||
VENV="/root/dsv4-nvfp4-workspace/venv/bin/activate"
|
||||
|
||||
echo "=== Running CUDA test on B200 via screen harness ==="
|
||||
|
||||
sshpass -p "$PASS" ssh $SSH_OPTS $B200 bash -s <<'REMOTE_SCRIPT'
|
||||
set -e
|
||||
REPO_DIR="/root/dsv4-nvfp4-workspace/kernel"
|
||||
|
||||
# --- CLEANUP (same as run_test.sh) ---
|
||||
if screen -list 2>/dev/null | grep -q kernel-test; then
|
||||
session_pid=$(screen -ls | grep kernel-test | grep -o '[0-9]*' | head -1)
|
||||
if [ -n "$session_pid" ]; then
|
||||
pkill -9 -P "$session_pid" 2>/dev/null || true
|
||||
fi
|
||||
screen -S kernel-test -X quit 2>/dev/null || true
|
||||
fi
|
||||
pkill -9 -f test_fmha 2>/dev/null || true
|
||||
pkill -9 -f test_tmem 2>/dev/null || true
|
||||
sleep 2
|
||||
|
||||
# Delete old log
|
||||
rm -f /tmp/kernel-test.log
|
||||
|
||||
# --- PULL ---
|
||||
cd $REPO_DIR
|
||||
git checkout -- . 2>/dev/null || true
|
||||
git clean -fd 2>/dev/null || true
|
||||
git pull
|
||||
|
||||
# --- COMPILE + RUN in screen ---
|
||||
export PATH=/usr/local/cuda-13.2/bin:$PATH
|
||||
|
||||
# Compile
|
||||
nvcc -std=c++20 -gencode=arch=compute_100a,code=sm_100a \
|
||||
-I$REPO_DIR \
|
||||
$REPO_DIR/tests/unit/test_fmha_sm100_standalone.cu \
|
||||
-o /tmp/test_fmha_sm100 -lcudart 2>&1 | tee /tmp/kernel-test.log
|
||||
|
||||
echo "" >> /tmp/kernel-test.log
|
||||
echo "=== Running test ===" >> /tmp/kernel-test.log
|
||||
|
||||
# Run in screen (survives SSH drops)
|
||||
screen -dmS kernel-test bash -c "timeout 60 /tmp/test_fmha_sm100 >> /tmp/kernel-test.log 2>&1; echo 'EXIT_CODE=$?' >> /tmp/kernel-test.log"
|
||||
sleep 3
|
||||
|
||||
if screen -list | grep -q kernel-test; then
|
||||
echo "OK: screen kernel-test is running"
|
||||
else
|
||||
echo "FAIL: screen did not start. Log below:"
|
||||
cat /tmp/kernel-test.log 2>/dev/null
|
||||
exit 1
|
||||
fi
|
||||
REMOTE_SCRIPT
|
||||
|
||||
echo "=== Test launched. Polling for results... ==="
|
||||
while true; do
|
||||
RESULT=$(sshpass -p "$PASS" ssh $SSH_OPTS $B200 "screen -list 2>/dev/null | grep -q kernel-test && echo running || echo done" 2>/dev/null || echo "done")
|
||||
if [ "$RESULT" != "running" ]; then
|
||||
echo "=== Screen finished. Results: ==="
|
||||
sshpass -p "$PASS" ssh $SSH_OPTS $B200 "cat /tmp/kernel-test.log"
|
||||
exit 0
|
||||
fi
|
||||
echo " ...still running..."
|
||||
sleep 10
|
||||
done
|
||||
Reference in New Issue
Block a user