3.0 KiB
3.0 KiB
SmolLM3-3B LoRA Training — Deployment Runbook
Prerequisites
- GPU server deployed and accessible via SSH
- SSH creds from Mike (host, user, key/password)
- Docker + Docker Compose + NVIDIA Container Toolkit installed on GPU server
- This repo at
/home/openclaw/dev/smollora
Step 1: SSH In & Prep the Host
ssh <user>@<host>
Verify GPU is visible:
nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
If Docker/nvidia toolkit missing, install before continuing.
Step 2: Create Persistent Directories
sudo mkdir -p /srv/smollora/{data,output,hf-cache}
sudo chown -R $(whoami):$(whoami) /srv/smollora
Step 3: Copy Project Files to Server
From the OpenClaw box:
scp -r /home/openclaw/dev/smollora <user>@<host>:/tmp/smollora
On the GPU server:
cp -r /tmp/smollora ~/smollora
cd ~/smollora
Step 4: Build & Start Container
cd ~/smollora
docker compose build
docker compose up -d
Verify it's running:
docker compose ps
docker compose logs --tail=20
Step 5: Exec In & Kick Off Training
docker compose exec smollora bash
Inside the container, run the pipeline:
/app/run.sh
Watch for:
- ✅ Datasets downloading successfully
- ✅ Samples counted (should be thousands)
- ✅ Model loading without OOM
- ✅ First few training steps completing (check loss is decreasing)
- ✅ No CUDA OOM errors in first 50 steps
If data prep already ran and you just want to re-train:
SKIP_PREP=1 /app/run.sh
Step 6: Monitor
From the host (no need to stay in the container):
# Follow logs
docker compose logs -f
# GPU utilization
watch -n5 nvidia-smi
Expected timeline on a single A100:
- Data prep: ~10-20 min
- Training (3 epochs, ~20-40k samples): ~2-4 hours
- Could be longer on smaller GPUs
Step 7: Verify Output
# Check the LoRA adapter was saved
ls -la /srv/smollora/output/final/
# Should see: adapter_config.json, adapter_model.safetensors, tokenizer files
Step 8: Notify Mike
Message Mike:
🎭 LoRA training is running on the GPU box. Data prep done, [N] samples, training started at [time]. Estimated completion: [est]. I'll check back periodically — will ping you if anything blows up or when it finishes.
Troubleshooting
| Problem | Fix |
|---|---|
| CUDA OOM | Reduce BATCH_SIZE to 2, increase GRAD_ACCUM to 8, or reduce MAX_LENGTH to 2048 |
| Dataset download fails | Check internet; can pre-download and mount into /data |
| Docker can't see GPU | Install nvidia-container-toolkit: sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker |
| Training loss not decreasing | Check LR — try 1e-4 or 5e-5; verify labels aren't all -100 |
| Disk full | Clean up /srv/smollora/hf-cache after model loads; processed data is small |
Rollback
If everything goes sideways:
docker compose down
rm -rf /srv/smollora/output/*
# Fix whatever broke, then:
docker compose up -d