diff --git a/README.md b/README.md index 7457dc97..87ac4bc5 100644 --- a/README.md +++ b/README.md @@ -196,23 +196,52 @@ dsv4/ ### The non-negotiables - **NEVER edit on the B200.** Always: edit locally → commit → push → pull on B200 → test. -- **ALWAYS use the test harness** (`fire_b200_test`, `run_test.sh`, `check_log.sh`). Never raw SSH+nohup. Nohup does not survive SSH drops; screen sessions do. +- **NEVER raw SSH + direct command.** Always use the test harness scripts. They handle: killing hung processes, deleting stale logs, screen sessions that survive SSH drops, timeouts for hung kernels, and GPU cleanup. - **ALWAYS verify hd=64 regression** (cos ~0.999998) after every FMHA change. If it regresses, the change is wrong. Revert. - **NEVER touch drivers, kernels, firmware, or system packages** on the B200. - **NEVER delete test files** in `tests/unit/` without explicit approval. -### Local → B200 cycle +### Two harnesses: Python and CUDA + +| Harness | For | Script | Screen name | Log file | +|---|---|---|---|---| +| Python | `test_*.py` files | `fire_b200_test` | `kernel-test` | `/tmp/kernel-test.log` | +| CUDA | `test_*.cu` files | `fire_b200_cuda_test` | `cuda-test` | `/tmp/cuda-test.log` | + +Both harnesses follow the same discipline: +1. **Kill everything first** — old screen sessions, hanging GPU processes, stale binaries +2. **Delete all logs** — never debug from a previous run's log +3. **Clean git + pull** — no uncommitted B200 state +4. **Run in screen** — survives SSH drops, has a timeout +5. **One test at a time** — no parallel launches, ever + +### Python test (one command) ```bash -# 1. Edit locally, commit, push -cd ~/dev/nvfp4-megamoe-kernel -git add -A && git commit -m "description" && git push origin master - -# 2. One-command test (auto-pushes, runs, dumps log) +# From local machine — auto-pushes, runs, polls, dumps log ~/.openclaw/workspace/fire_b200_test tests/unit/test_fmha_v3_stage_c.py ``` -### Manual B200 cycle +### CUDA test (one command) + +```bash +# From local machine — compiles with nvcc, runs, polls, dumps log +# Default timeout: 60s. Pass a second arg for custom timeout. +~/.openclaw/workspace/fire_b200_cuda_test tests/unit/test_fmha_sm100_standalone.cu +~/.openclaw/workspace/fire_b200_cuda_test tests/unit/test_tmem_minimal.cu 30 +``` + +### Check on a running CUDA test + +```bash +# Show current log + screen status +~/.openclaw/workspace/check_b200_cuda + +# Kill a hung test + show the log +~/.openclaw/workspace/check_b200_cuda kill +``` + +### Manual B200 cycle (emergency only) ```bash ssh root@ diff --git a/tests/run_cuda_test.sh b/tests/run_cuda_test.sh deleted file mode 100755 index 911de345..00000000 --- a/tests/run_cuda_test.sh +++ /dev/null @@ -1,75 +0,0 @@ -#!/bin/bash -# Run the standalone CUDA FMHA test on the B200 using screen sessions. -# Follows the test harness pattern from README.md. -# Usage: bash tests/run_cuda_test.sh -set -e - -B200="root@45.76.247.107" -SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=4" -PASS='6)Jr)B@dcX[mN?dx' -REPO_DIR="/root/dsv4-nvfp4-workspace/kernel" -VENV="/root/dsv4-nvfp4-workspace/venv/bin/activate" - -echo "=== Running CUDA test on B200 via screen harness ===" - -sshpass -p "$PASS" ssh $SSH_OPTS $B200 bash -s <<'REMOTE_SCRIPT' -set -e -REPO_DIR="/root/dsv4-nvfp4-workspace/kernel" - -# --- CLEANUP (same as run_test.sh) --- -if screen -list 2>/dev/null | grep -q kernel-test; then - session_pid=$(screen -ls | grep kernel-test | grep -o '[0-9]*' | head -1) - if [ -n "$session_pid" ]; then - pkill -9 -P "$session_pid" 2>/dev/null || true - fi - screen -S kernel-test -X quit 2>/dev/null || true -fi -pkill -9 -f test_fmha 2>/dev/null || true -pkill -9 -f test_tmem 2>/dev/null || true -sleep 2 - -# Delete old log -rm -f /tmp/kernel-test.log - -# --- PULL --- -cd $REPO_DIR -git checkout -- . 2>/dev/null || true -git clean -fd 2>/dev/null || true -git pull - -# --- COMPILE + RUN in screen --- -export PATH=/usr/local/cuda-13.2/bin:$PATH - -# Compile -nvcc -std=c++20 -gencode=arch=compute_100a,code=sm_100a \ - -I$REPO_DIR \ - $REPO_DIR/tests/unit/test_fmha_sm100_standalone.cu \ - -o /tmp/test_fmha_sm100 -lcudart 2>&1 | tee /tmp/kernel-test.log - -echo "" >> /tmp/kernel-test.log -echo "=== Running test ===" >> /tmp/kernel-test.log - -# Run in screen (survives SSH drops) -screen -dmS kernel-test bash -c "timeout 60 /tmp/test_fmha_sm100 >> /tmp/kernel-test.log 2>&1; echo 'EXIT_CODE=$?' >> /tmp/kernel-test.log" -sleep 3 - -if screen -list | grep -q kernel-test; then - echo "OK: screen kernel-test is running" -else - echo "FAIL: screen did not start. Log below:" - cat /tmp/kernel-test.log 2>/dev/null - exit 1 -fi -REMOTE_SCRIPT - -echo "=== Test launched. Polling for results... ===" -while true; do - RESULT=$(sshpass -p "$PASS" ssh $SSH_OPTS $B200 "screen -list 2>/dev/null | grep -q kernel-test && echo running || echo done" 2>/dev/null || echo "done") - if [ "$RESULT" != "running" ]; then - echo "=== Screen finished. Results: ===" - sshpass -p "$PASS" ssh $SSH_OPTS $B200 "cat /tmp/kernel-test.log" - exit 0 - fi - echo " ...still running..." - sleep 10 -done