harness: add fire_b200_cuda_test + check_b200_cuda, update README

Two new turnkey harness scripts for .cu tests:
- fire_b200_cuda_test: compile+run+poll, kills everything first,
  deletes old logs, one test at a time, screen-based, timeout
- check_b200_cuda: peek at running test log, or kill hung test

README updated with CUDA harness documentation.
Removed janky tests/run_cuda_test.sh.
This commit is contained in:
2026-05-28 07:36:10 +00:00
parent cec505ce14
commit 224d7e24c6
2 changed files with 37 additions and 83 deletions

View File

@@ -196,23 +196,52 @@ dsv4/
### The non-negotiables
- **NEVER edit on the B200.** Always: edit locally → commit → push → pull on B200 → test.
- **ALWAYS use the test harness** (`fire_b200_test`, `run_test.sh`, `check_log.sh`). Never raw SSH+nohup. Nohup does not survive SSH drops; screen sessions do.
- **NEVER raw SSH + direct command.** Always use the test harness scripts. They handle: killing hung processes, deleting stale logs, screen sessions that survive SSH drops, timeouts for hung kernels, and GPU cleanup.
- **ALWAYS verify hd=64 regression** (cos ~0.999998) after every FMHA change. If it regresses, the change is wrong. Revert.
- **NEVER touch drivers, kernels, firmware, or system packages** on the B200.
- **NEVER delete test files** in `tests/unit/` without explicit approval.
### Local → B200 cycle
### Two harnesses: Python and CUDA
| Harness | For | Script | Screen name | Log file |
|---|---|---|---|---|
| Python | `test_*.py` files | `fire_b200_test` | `kernel-test` | `/tmp/kernel-test.log` |
| CUDA | `test_*.cu` files | `fire_b200_cuda_test` | `cuda-test` | `/tmp/cuda-test.log` |
Both harnesses follow the same discipline:
1. **Kill everything first** — old screen sessions, hanging GPU processes, stale binaries
2. **Delete all logs** — never debug from a previous run's log
3. **Clean git + pull** — no uncommitted B200 state
4. **Run in screen** — survives SSH drops, has a timeout
5. **One test at a time** — no parallel launches, ever
### Python test (one command)
```bash
# 1. Edit locally, commit, push
cd ~/dev/nvfp4-megamoe-kernel
git add -A && git commit -m "description" && git push origin master
# 2. One-command test (auto-pushes, runs, dumps log)
# From local machine — auto-pushes, runs, polls, dumps log
~/.openclaw/workspace/fire_b200_test tests/unit/test_fmha_v3_stage_c.py
```
### Manual B200 cycle
### CUDA test (one command)
```bash
# From local machine — compiles with nvcc, runs, polls, dumps log
# Default timeout: 60s. Pass a second arg for custom timeout.
~/.openclaw/workspace/fire_b200_cuda_test tests/unit/test_fmha_sm100_standalone.cu
~/.openclaw/workspace/fire_b200_cuda_test tests/unit/test_tmem_minimal.cu 30
```
### Check on a running CUDA test
```bash
# Show current log + screen status
~/.openclaw/workspace/check_b200_cuda
# Kill a hung test + show the log
~/.openclaw/workspace/check_b200_cuda kill
```
### Manual B200 cycle (emergency only)
```bash
ssh root@<B200>

View File

@@ -1,75 +0,0 @@
#!/bin/bash
# Run the standalone CUDA FMHA test on the B200 using screen sessions.
# Follows the test harness pattern from README.md.
# Usage: bash tests/run_cuda_test.sh
set -e
B200="root@45.76.247.107"
SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=4"
PASS='6)Jr)B@dcX[mN?dx'
REPO_DIR="/root/dsv4-nvfp4-workspace/kernel"
VENV="/root/dsv4-nvfp4-workspace/venv/bin/activate"
echo "=== Running CUDA test on B200 via screen harness ==="
sshpass -p "$PASS" ssh $SSH_OPTS $B200 bash -s <<'REMOTE_SCRIPT'
set -e
REPO_DIR="/root/dsv4-nvfp4-workspace/kernel"
# --- CLEANUP (same as run_test.sh) ---
if screen -list 2>/dev/null | grep -q kernel-test; then
session_pid=$(screen -ls | grep kernel-test | grep -o '[0-9]*' | head -1)
if [ -n "$session_pid" ]; then
pkill -9 -P "$session_pid" 2>/dev/null || true
fi
screen -S kernel-test -X quit 2>/dev/null || true
fi
pkill -9 -f test_fmha 2>/dev/null || true
pkill -9 -f test_tmem 2>/dev/null || true
sleep 2
# Delete old log
rm -f /tmp/kernel-test.log
# --- PULL ---
cd $REPO_DIR
git checkout -- . 2>/dev/null || true
git clean -fd 2>/dev/null || true
git pull
# --- COMPILE + RUN in screen ---
export PATH=/usr/local/cuda-13.2/bin:$PATH
# Compile
nvcc -std=c++20 -gencode=arch=compute_100a,code=sm_100a \
-I$REPO_DIR \
$REPO_DIR/tests/unit/test_fmha_sm100_standalone.cu \
-o /tmp/test_fmha_sm100 -lcudart 2>&1 | tee /tmp/kernel-test.log
echo "" >> /tmp/kernel-test.log
echo "=== Running test ===" >> /tmp/kernel-test.log
# Run in screen (survives SSH drops)
screen -dmS kernel-test bash -c "timeout 60 /tmp/test_fmha_sm100 >> /tmp/kernel-test.log 2>&1; echo 'EXIT_CODE=$?' >> /tmp/kernel-test.log"
sleep 3
if screen -list | grep -q kernel-test; then
echo "OK: screen kernel-test is running"
else
echo "FAIL: screen did not start. Log below:"
cat /tmp/kernel-test.log 2>/dev/null
exit 1
fi
REMOTE_SCRIPT
echo "=== Test launched. Polling for results... ==="
while true; do
RESULT=$(sshpass -p "$PASS" ssh $SSH_OPTS $B200 "screen -list 2>/dev/null | grep -q kernel-test && echo running || echo done" 2>/dev/null || echo "done")
if [ "$RESULT" != "running" ]; then
echo "=== Screen finished. Results: ==="
sshpass -p "$PASS" ssh $SSH_OPTS $B200 "cat /tmp/kernel-test.log"
exit 0
fi
echo " ...still running..."
sleep 10
done