diff --git a/RUNBOOK.md b/RUNBOOK.md new file mode 100644 index 0000000..2d5b136 --- /dev/null +++ b/RUNBOOK.md @@ -0,0 +1,128 @@ +# SmolLM3-3B LoRA Training — Deployment Runbook + +## Prerequisites + +- [ ] GPU server deployed and accessible via SSH +- [ ] SSH creds from Mike (host, user, key/password) +- [ ] Docker + Docker Compose + NVIDIA Container Toolkit installed on GPU server +- [ ] This repo at `/home/openclaw/dev/smollora` + +## Step 1: SSH In & Prep the Host + +```bash +ssh @ +``` + +Verify GPU is visible: +```bash +nvidia-smi +docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi +``` + +If Docker/nvidia toolkit missing, install before continuing. + +## Step 2: Create Persistent Directories + +```bash +sudo mkdir -p /srv/smollora/{data,output,hf-cache} +sudo chown -R $(whoami):$(whoami) /srv/smollora +``` + +## Step 3: Copy Project Files to Server + +From the OpenClaw box: +```bash +scp -r /home/openclaw/dev/smollora @:/tmp/smollora +``` + +On the GPU server: +```bash +cp -r /tmp/smollora ~/smollora +cd ~/smollora +``` + +## Step 4: Build & Start Container + +```bash +cd ~/smollora +docker compose build +docker compose up -d +``` + +Verify it's running: +```bash +docker compose ps +docker compose logs --tail=20 +``` + +## Step 5: Exec In & Kick Off Training + +```bash +docker compose exec smollora bash +``` + +Inside the container, run the pipeline: +```bash +/app/run.sh +``` + +Watch for: +- ✅ Datasets downloading successfully +- ✅ Samples counted (should be thousands) +- ✅ Model loading without OOM +- ✅ First few training steps completing (check loss is decreasing) +- ✅ No CUDA OOM errors in first 50 steps + +If data prep already ran and you just want to re-train: +```bash +SKIP_PREP=1 /app/run.sh +``` + +## Step 6: Monitor + +From the host (no need to stay in the container): +```bash +# Follow logs +docker compose logs -f + +# GPU utilization +watch -n5 nvidia-smi +``` + +Expected timeline on a single A100: +- Data prep: ~10-20 min +- Training (3 epochs, ~20-40k samples): ~2-4 hours +- Could be longer on smaller GPUs + +## Step 7: Verify Output + +```bash +# Check the LoRA adapter was saved +ls -la /srv/smollora/output/final/ +# Should see: adapter_config.json, adapter_model.safetensors, tokenizer files +``` + +## Step 8: Notify Mike + +Message Mike: +> 🎭 LoRA training is running on the GPU box. Data prep done, [N] samples, training started at [time]. Estimated completion: [est]. I'll check back periodically — will ping you if anything blows up or when it finishes. + +## Troubleshooting + +| Problem | Fix | +|---------|-----| +| CUDA OOM | Reduce `BATCH_SIZE` to 2, increase `GRAD_ACCUM` to 8, or reduce `MAX_LENGTH` to 2048 | +| Dataset download fails | Check internet; can pre-download and mount into `/data` | +| Docker can't see GPU | Install nvidia-container-toolkit: `sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker` | +| Training loss not decreasing | Check LR — try `1e-4` or `5e-5`; verify labels aren't all -100 | +| Disk full | Clean up `/srv/smollora/hf-cache` after model loads; processed data is small | + +## Rollback + +If everything goes sideways: +```bash +docker compose down +rm -rf /srv/smollora/output/* +# Fix whatever broke, then: +docker compose up -d +```