129 lines
3.0 KiB
Markdown
129 lines
3.0 KiB
Markdown
|
|
# SmolLM3-3B LoRA Training — Deployment Runbook
|
||
|
|
|
||
|
|
## Prerequisites
|
||
|
|
|
||
|
|
- [ ] GPU server deployed and accessible via SSH
|
||
|
|
- [ ] SSH creds from Mike (host, user, key/password)
|
||
|
|
- [ ] Docker + Docker Compose + NVIDIA Container Toolkit installed on GPU server
|
||
|
|
- [ ] This repo at `/home/openclaw/dev/smollora`
|
||
|
|
|
||
|
|
## Step 1: SSH In & Prep the Host
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ssh <user>@<host>
|
||
|
|
```
|
||
|
|
|
||
|
|
Verify GPU is visible:
|
||
|
|
```bash
|
||
|
|
nvidia-smi
|
||
|
|
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
|
||
|
|
```
|
||
|
|
|
||
|
|
If Docker/nvidia toolkit missing, install before continuing.
|
||
|
|
|
||
|
|
## Step 2: Create Persistent Directories
|
||
|
|
|
||
|
|
```bash
|
||
|
|
sudo mkdir -p /srv/smollora/{data,output,hf-cache}
|
||
|
|
sudo chown -R $(whoami):$(whoami) /srv/smollora
|
||
|
|
```
|
||
|
|
|
||
|
|
## Step 3: Copy Project Files to Server
|
||
|
|
|
||
|
|
From the OpenClaw box:
|
||
|
|
```bash
|
||
|
|
scp -r /home/openclaw/dev/smollora <user>@<host>:/tmp/smollora
|
||
|
|
```
|
||
|
|
|
||
|
|
On the GPU server:
|
||
|
|
```bash
|
||
|
|
cp -r /tmp/smollora ~/smollora
|
||
|
|
cd ~/smollora
|
||
|
|
```
|
||
|
|
|
||
|
|
## Step 4: Build & Start Container
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd ~/smollora
|
||
|
|
docker compose build
|
||
|
|
docker compose up -d
|
||
|
|
```
|
||
|
|
|
||
|
|
Verify it's running:
|
||
|
|
```bash
|
||
|
|
docker compose ps
|
||
|
|
docker compose logs --tail=20
|
||
|
|
```
|
||
|
|
|
||
|
|
## Step 5: Exec In & Kick Off Training
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker compose exec smollora bash
|
||
|
|
```
|
||
|
|
|
||
|
|
Inside the container, run the pipeline:
|
||
|
|
```bash
|
||
|
|
/app/run.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
Watch for:
|
||
|
|
- ✅ Datasets downloading successfully
|
||
|
|
- ✅ Samples counted (should be thousands)
|
||
|
|
- ✅ Model loading without OOM
|
||
|
|
- ✅ First few training steps completing (check loss is decreasing)
|
||
|
|
- ✅ No CUDA OOM errors in first 50 steps
|
||
|
|
|
||
|
|
If data prep already ran and you just want to re-train:
|
||
|
|
```bash
|
||
|
|
SKIP_PREP=1 /app/run.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
## Step 6: Monitor
|
||
|
|
|
||
|
|
From the host (no need to stay in the container):
|
||
|
|
```bash
|
||
|
|
# Follow logs
|
||
|
|
docker compose logs -f
|
||
|
|
|
||
|
|
# GPU utilization
|
||
|
|
watch -n5 nvidia-smi
|
||
|
|
```
|
||
|
|
|
||
|
|
Expected timeline on a single A100:
|
||
|
|
- Data prep: ~10-20 min
|
||
|
|
- Training (3 epochs, ~20-40k samples): ~2-4 hours
|
||
|
|
- Could be longer on smaller GPUs
|
||
|
|
|
||
|
|
## Step 7: Verify Output
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check the LoRA adapter was saved
|
||
|
|
ls -la /srv/smollora/output/final/
|
||
|
|
# Should see: adapter_config.json, adapter_model.safetensors, tokenizer files
|
||
|
|
```
|
||
|
|
|
||
|
|
## Step 8: Notify Mike
|
||
|
|
|
||
|
|
Message Mike:
|
||
|
|
> 🎭 LoRA training is running on the GPU box. Data prep done, [N] samples, training started at [time]. Estimated completion: [est]. I'll check back periodically — will ping you if anything blows up or when it finishes.
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
| Problem | Fix |
|
||
|
|
|---------|-----|
|
||
|
|
| CUDA OOM | Reduce `BATCH_SIZE` to 2, increase `GRAD_ACCUM` to 8, or reduce `MAX_LENGTH` to 2048 |
|
||
|
|
| Dataset download fails | Check internet; can pre-download and mount into `/data` |
|
||
|
|
| Docker can't see GPU | Install nvidia-container-toolkit: `sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker` |
|
||
|
|
| Training loss not decreasing | Check LR — try `1e-4` or `5e-5`; verify labels aren't all -100 |
|
||
|
|
| Disk full | Clean up `/srv/smollora/hf-cache` after model loads; processed data is small |
|
||
|
|
|
||
|
|
## Rollback
|
||
|
|
|
||
|
|
If everything goes sideways:
|
||
|
|
```bash
|
||
|
|
docker compose down
|
||
|
|
rm -rf /srv/smollora/output/*
|
||
|
|
# Fix whatever broke, then:
|
||
|
|
docker compose up -d
|
||
|
|
```
|