Add deployment runbook

2026-04-10 05:28:30 +00:00
parent 6af62c85d5
commit 46a3ddbb25
1 changed files with 128 additions and 0 deletions
--- a/RUNBOOK.md
+++ b/RUNBOOK.md
@@ -0,0 +1,128 @@
+# SmolLM3-3B LoRA Training — Deployment Runbook
+
+## Prerequisites
+
+- [ ] GPU server deployed and accessible via SSH
+- [ ] SSH creds from Mike (host, user, key/password)
+- [ ] Docker + Docker Compose + NVIDIA Container Toolkit installed on GPU server
+- [ ] This repo at `/home/openclaw/dev/smollora`
+
+## Step 1: SSH In & Prep the Host
+
+```bash
+ssh <user>@<host>
+```
+
+Verify GPU is visible:
+```bash
+nvidia-smi
+docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
+```
+
+If Docker/nvidia toolkit missing, install before continuing.
+
+## Step 2: Create Persistent Directories
+
+```bash
+sudo mkdir -p /srv/smollora/{data,output,hf-cache}
+sudo chown -R $(whoami):$(whoami) /srv/smollora
+```
+
+## Step 3: Copy Project Files to Server
+
+From the OpenClaw box:
+```bash
+scp -r /home/openclaw/dev/smollora <user>@<host>:/tmp/smollora
+```
+
+On the GPU server:
+```bash
+cp -r /tmp/smollora ~/smollora
+cd ~/smollora
+```
+
+## Step 4: Build & Start Container
+
+```bash
+cd ~/smollora
+docker compose build
+docker compose up -d
+```
+
+Verify it's running:
+```bash
+docker compose ps
+docker compose logs --tail=20
+```
+
+## Step 5: Exec In & Kick Off Training
+
+```bash
+docker compose exec smollora bash
+```
+
+Inside the container, run the pipeline:
+```bash
+/app/run.sh
+```
+
+Watch for:
+- ✅ Datasets downloading successfully
+- ✅ Samples counted (should be thousands)
+- ✅ Model loading without OOM
+- ✅ First few training steps completing (check loss is decreasing)
+- ✅ No CUDA OOM errors in first 50 steps
+
+If data prep already ran and you just want to re-train:
+```bash
+SKIP_PREP=1 /app/run.sh
+```
+
+## Step 6: Monitor
+
+From the host (no need to stay in the container):
+```bash
+# Follow logs
+docker compose logs -f
+
+# GPU utilization
+watch -n5 nvidia-smi
+```
+
+Expected timeline on a single A100:
+- Data prep: ~10-20 min
+- Training (3 epochs, ~20-40k samples): ~2-4 hours
+- Could be longer on smaller GPUs
+
+## Step 7: Verify Output
+
+```bash
+# Check the LoRA adapter was saved
+ls -la /srv/smollora/output/final/
+# Should see: adapter_config.json, adapter_model.safetensors, tokenizer files
+```
+
+## Step 8: Notify Mike
+
+Message Mike:
+> 🎭 LoRA training is running on the GPU box. Data prep done, [N] samples, training started at [time]. Estimated completion: [est]. I'll check back periodically — will ping you if anything blows up or when it finishes.
+
+## Troubleshooting
+
+| Problem | Fix |
+|---------|-----|
+| CUDA OOM | Reduce `BATCH_SIZE` to 2, increase `GRAD_ACCUM` to 8, or reduce `MAX_LENGTH` to 2048 |
+| Dataset download fails | Check internet; can pre-download and mount into `/data` |
+| Docker can't see GPU | Install nvidia-container-toolkit: `sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker` |
+| Training loss not decreasing | Check LR — try `1e-4` or `5e-5`; verify labels aren't all -100 |
+| Disk full | Clean up `/srv/smollora/hf-cache` after model loads; processed data is small |
+
+## Rollback
+
+If everything goes sideways:
+```bash
+docker compose down
+rm -rf /srv/smollora/output/*
+# Fix whatever broke, then:
+docker compose up -d
+```