RUNBOOK.md

# SmolLM3-3B LoRA Training — Deployment Runbook

## Prerequisites

- [ ] GPU server deployed and accessible via SSH
- [ ] SSH creds from Mike (host, user, key/password)
- [ ] Docker + Docker Compose + NVIDIA Container Toolkit installed on GPU server
- [ ] This repo at `/home/openclaw/dev/smollora`

## Step 1: SSH In & Prep the Host

```bash
ssh <user>@<host>
```

Verify GPU is visible:
```bash
nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
```

If Docker/nvidia toolkit missing, install before continuing.

## Step 2: Create Persistent Directories

```bash
sudo mkdir -p /srv/smollora/{data,output,hf-cache}
sudo chown -R $(whoami):$(whoami) /srv/smollora
```

## Step 3: Copy Project Files to Server

From the OpenClaw box:
```bash
scp -r /home/openclaw/dev/smollora <user>@<host>:/tmp/smollora
```

On the GPU server:
```bash
cp -r /tmp/smollora ~/smollora
cd ~/smollora
```

## Step 4: Build & Start Container

```bash
cd ~/smollora
docker compose build
docker compose up -d
```

Verify it's running:
```bash
docker compose ps
docker compose logs --tail=20
```

## Step 5: Exec In & Kick Off Training

```bash
docker compose exec smollora bash
```

Inside the container, run the pipeline:
```bash
/app/run.sh
```

Watch for:
- ✅ Datasets downloading successfully
- ✅ Samples counted (should be thousands)
- ✅ Model loading without OOM
- ✅ First few training steps completing (check loss is decreasing)
- ✅ No CUDA OOM errors in first 50 steps

If data prep already ran and you just want to re-train:
```bash
SKIP_PREP=1 /app/run.sh
```

## Step 6: Monitor

From the host (no need to stay in the container):
```bash
# Follow logs
docker compose logs -f

# GPU utilization
watch -n5 nvidia-smi
```

Expected timeline on a single A100:
- Data prep: ~10-20 min
- Training (3 epochs, ~20-40k samples): ~2-4 hours
- Could be longer on smaller GPUs

## Step 7: Verify Output

```bash
# Check the LoRA adapter was saved
ls -la /srv/smollora/output/final/
# Should see: adapter_config.json, adapter_model.safetensors, tokenizer files
```

## Step 8: Notify Mike

Message Mike:
> 🎭 LoRA training is running on the GPU box. Data prep done, [N] samples, training started at [time]. Estimated completion: [est]. I'll check back periodically — will ping you if anything blows up or when it finishes.

## Troubleshooting

| Problem | Fix |
|---------|-----|
| CUDA OOM | Reduce `BATCH_SIZE` to 2, increase `GRAD_ACCUM` to 8, or reduce `MAX_LENGTH` to 2048 |
| Dataset download fails | Check internet; can pre-download and mount into `/data` |
| Docker can't see GPU | Install nvidia-container-toolkit: `sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker` |
| Training loss not decreasing | Check LR — try `1e-4` or `5e-5`; verify labels aren't all -100 |
| Disk full | Clean up `/srv/smollora/hf-cache` after model loads; processed data is small |

## Rollback

If everything goes sideways:
```bash
docker compose down
rm -rf /srv/smollora/output/*
# Fix whatever broke, then:
docker compose up -d
```
Add deployment runbook 2026-04-10 05:28:30 +00:00			`# SmolLM3-3B LoRA Training — Deployment Runbook`

			`## Prerequisites`

			`- [ ] GPU server deployed and accessible via SSH`
			`- [ ] SSH creds from Mike (host, user, key/password)`
			`- [ ] Docker + Docker Compose + NVIDIA Container Toolkit installed on GPU server`
			- [ ] This repo at `/home/openclaw/dev/smollora`

			`## Step 1: SSH In & Prep the Host`

			```bash
			`ssh <user>@<host>`
			```

			`Verify GPU is visible:`
			```bash
			`nvidia-smi`
			`docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi`
			```

			`If Docker/nvidia toolkit missing, install before continuing.`

			`## Step 2: Create Persistent Directories`

			```bash
			`sudo mkdir -p /srv/smollora/{data,output,hf-cache}`
			`sudo chown -R $(whoami):$(whoami) /srv/smollora`
			```

			`## Step 3: Copy Project Files to Server`

			`From the OpenClaw box:`
			```bash
			`scp -r /home/openclaw/dev/smollora <user>@<host>:/tmp/smollora`
			```

			`On the GPU server:`
			```bash
			`cp -r /tmp/smollora ~/smollora`
			`cd ~/smollora`
			```

			`## Step 4: Build & Start Container`

			```bash
			`cd ~/smollora`
			`docker compose build`
			`docker compose up -d`
			```

			`Verify it's running:`
			```bash
			`docker compose ps`
			`docker compose logs --tail=20`
			```

			`## Step 5: Exec In & Kick Off Training`

			```bash
			`docker compose exec smollora bash`
			```

			`Inside the container, run the pipeline:`
			```bash
			`/app/run.sh`
			```

			`Watch for:`
			`- ✅ Datasets downloading successfully`
			`- ✅ Samples counted (should be thousands)`
			`- ✅ Model loading without OOM`
			`- ✅ First few training steps completing (check loss is decreasing)`
			`- ✅ No CUDA OOM errors in first 50 steps`

			`If data prep already ran and you just want to re-train:`
			```bash
			`SKIP_PREP=1 /app/run.sh`
			```

			`## Step 6: Monitor`

			`From the host (no need to stay in the container):`
			```bash
			`# Follow logs`
			`docker compose logs -f`

			`# GPU utilization`
			`watch -n5 nvidia-smi`
			```

			`Expected timeline on a single A100:`
			`- Data prep: ~10-20 min`
			`- Training (3 epochs, ~20-40k samples): ~2-4 hours`
			`- Could be longer on smaller GPUs`

			`## Step 7: Verify Output`

			```bash
			`# Check the LoRA adapter was saved`
			`ls -la /srv/smollora/output/final/`
			`# Should see: adapter_config.json, adapter_model.safetensors, tokenizer files`
			```

			`## Step 8: Notify Mike`

			`Message Mike:`
			`> 🎭 LoRA training is running on the GPU box. Data prep done, [N] samples, training started at [time]. Estimated completion: [est]. I'll check back periodically — will ping you if anything blows up or when it finishes.`

			`## Troubleshooting`

			`\| Problem \| Fix \|`
			`\|---------\|-----\|`
			\| CUDA OOM \| Reduce `BATCH_SIZE` to 2, increase `GRAD_ACCUM` to 8, or reduce `MAX_LENGTH` to 2048 \|
			\| Dataset download fails \| Check internet; can pre-download and mount into `/data` \|
			\| Docker can't see GPU \| Install nvidia-container-toolkit: `sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker` \|
			\| Training loss not decreasing \| Check LR — try `1e-4` or `5e-5`; verify labels aren't all -100 \|
			\| Disk full \| Clean up `/srv/smollora/hf-cache` after model loads; processed data is small \|

			`## Rollback`

			`If everything goes sideways:`
			```bash
			`docker compose down`
			`rm -rf /srv/smollora/output/*`
			`# Fix whatever broke, then:`
			`docker compose up -d`
			```