harness: add fire_b200_cuda_test + check_b200_cuda, update README

Two new turnkey harness scripts for .cu tests: - fire_b200_cuda_test: compile+run+poll, kills everything first, deletes old logs, one test at a time, screen-based, timeout - check_b200_cuda: peek at running test log, or kill hung test README updated with CUDA harness documentation. Removed janky tests/run_cuda_test.sh.
2026-05-28 07:36:10 +00:00
parent cec505ce14
commit 224d7e24c6
2 changed files with 37 additions and 83 deletions
--- a/README.md
+++ b/README.md
@@ -196,23 +196,52 @@ dsv4/
 ### The non-negotiables

 - **NEVER edit on the B200.** Always: edit locally → commit → push → pull on B200 → test.
- **ALWAYS use the test harness** (`fire_b200_test`, `run_test.sh`, `check_log.sh`). Never raw SSH+nohup. Nohup does not survive SSH drops; screen sessions do.
+- **NEVER raw SSH + direct command.** Always use the test harness scripts. They handle: killing hung processes, deleting stale logs, screen sessions that survive SSH drops, timeouts for hung kernels, and GPU cleanup.
 - **ALWAYS verify hd=64 regression** (cos ~0.999998) after every FMHA change. If it regresses, the change is wrong. Revert.
 - **NEVER touch drivers, kernels, firmware, or system packages** on the B200.
 - **NEVER delete test files** in `tests/unit/` without explicit approval.

-### Local → B200 cycle
+### Two harnesses: Python and CUDA
+
+| Harness | For | Script | Screen name | Log file |
+|---|---|---|---|---|
+| Python | `test_*.py` files | `fire_b200_test` | `kernel-test` | `/tmp/kernel-test.log` |
+| CUDA | `test_*.cu` files | `fire_b200_cuda_test` | `cuda-test` | `/tmp/cuda-test.log` |
+
+Both harnesses follow the same discipline:
+1. **Kill everything first** — old screen sessions, hanging GPU processes, stale binaries
+2. **Delete all logs** — never debug from a previous run's log
+3. **Clean git + pull** — no uncommitted B200 state
+4. **Run in screen** — survives SSH drops, has a timeout
+5. **One test at a time** — no parallel launches, ever
+
+### Python test (one command)

 ```bash
-# 1. Edit locally, commit, push
-cd ~/dev/nvfp4-megamoe-kernel
-git add -A && git commit -m "description" && git push origin master
-
-# 2. One-command test (auto-pushes, runs, dumps log)
+# From local machine — auto-pushes, runs, polls, dumps log
 ~/.openclaw/workspace/fire_b200_test tests/unit/test_fmha_v3_stage_c.py
 ```

-### Manual B200 cycle
+### CUDA test (one command)
+
+```bash
+# From local machine — compiles with nvcc, runs, polls, dumps log
+# Default timeout: 60s. Pass a second arg for custom timeout.
+~/.openclaw/workspace/fire_b200_cuda_test tests/unit/test_fmha_sm100_standalone.cu
+~/.openclaw/workspace/fire_b200_cuda_test tests/unit/test_tmem_minimal.cu 30
+```
+
+### Check on a running CUDA test
+
+```bash
+# Show current log + screen status
+~/.openclaw/workspace/check_b200_cuda
+
+# Kill a hung test + show the log
+~/.openclaw/workspace/check_b200_cuda kill
+```
+
+### Manual B200 cycle (emergency only)

 ```bash
 ssh root@<B200>
--- a/tests/run_cuda_test.sh
+++ b/tests/run_cuda_test.sh
@@ -1,75 +0,0 @@
-#!/bin/bash
-# Run the standalone CUDA FMHA test on the B200 using screen sessions.
-# Follows the test harness pattern from README.md.
-# Usage: bash tests/run_cuda_test.sh
-set -e
-
-B200="root@45.76.247.107"
-SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=4"
-PASS='6)Jr)B@dcX[mN?dx'
-REPO_DIR="/root/dsv4-nvfp4-workspace/kernel"
-VENV="/root/dsv4-nvfp4-workspace/venv/bin/activate"
-
-echo "=== Running CUDA test on B200 via screen harness ==="
-
-sshpass -p "$PASS" ssh $SSH_OPTS $B200 bash -s <<'REMOTE_SCRIPT'
-set -e
-REPO_DIR="/root/dsv4-nvfp4-workspace/kernel"
-
-# --- CLEANUP (same as run_test.sh) ---
-if screen -list 2>/dev/null | grep -q kernel-test; then
-    session_pid=$(screen -ls | grep kernel-test | grep -o '[0-9]*' | head -1)
-    if [ -n "$session_pid" ]; then
-        pkill -9 -P "$session_pid" 2>/dev/null || true
-    fi
-    screen -S kernel-test -X quit 2>/dev/null || true
-fi
-pkill -9 -f test_fmha 2>/dev/null || true
-pkill -9 -f test_tmem 2>/dev/null || true
-sleep 2
-
-# Delete old log
-rm -f /tmp/kernel-test.log
-
-# --- PULL ---
-cd $REPO_DIR
-git checkout -- . 2>/dev/null || true
-git clean -fd 2>/dev/null || true
-git pull
-
-# --- COMPILE + RUN in screen ---
-export PATH=/usr/local/cuda-13.2/bin:$PATH
-
-# Compile
-nvcc -std=c++20 -gencode=arch=compute_100a,code=sm_100a \
-    -I$REPO_DIR \
-    $REPO_DIR/tests/unit/test_fmha_sm100_standalone.cu \
-    -o /tmp/test_fmha_sm100 -lcudart 2>&1 | tee /tmp/kernel-test.log
-
-echo "" >> /tmp/kernel-test.log
-echo "=== Running test ===" >> /tmp/kernel-test.log
-
-# Run in screen (survives SSH drops)
-screen -dmS kernel-test bash -c "timeout 60 /tmp/test_fmha_sm100 >> /tmp/kernel-test.log 2>&1; echo 'EXIT_CODE=$?' >> /tmp/kernel-test.log"
-sleep 3
-
-if screen -list | grep -q kernel-test; then
-    echo "OK: screen kernel-test is running"
-else
-    echo "FAIL: screen did not start. Log below:"
-    cat /tmp/kernel-test.log 2>/dev/null
-    exit 1
-fi
-REMOTE_SCRIPT
-
-echo "=== Test launched. Polling for results... ==="
-while true; do
-    RESULT=$(sshpass -p "$PASS" ssh $SSH_OPTS $B200 "screen -list 2>/dev/null | grep -q kernel-test && echo running || echo done" 2>/dev/null || echo "done")
-    if [ "$RESULT" != "running" ]; then
-        echo "=== Screen finished. Results: ==="
-        sshpass -p "$PASS" ssh $SSH_OPTS $B200 "cat /tmp/kernel-test.log"
-        exit 0
-    fi
-    echo "  ...still running..."
-    sleep 10
-done