biondizzle/smollora

Fork 0

Files

Jinx 46a3ddbb25 Add deployment runbook

2026-04-10 05:28:30 +00:00

3.0 KiB

Raw Blame History

SmolLM3-3B LoRA Training — Deployment Runbook

Prerequisites

GPU server deployed and accessible via SSH
SSH creds from Mike (host, user, key/password)
Docker + Docker Compose + NVIDIA Container Toolkit installed on GPU server
This repo at /home/openclaw/dev/smollora

Step 1: SSH In & Prep the Host

ssh <user>@<host>

Verify GPU is visible:

nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If Docker/nvidia toolkit missing, install before continuing.

Step 2: Create Persistent Directories

sudo mkdir -p /srv/smollora/{data,output,hf-cache}
sudo chown -R $(whoami):$(whoami) /srv/smollora

Step 3: Copy Project Files to Server

From the OpenClaw box:

scp -r /home/openclaw/dev/smollora <user>@<host>:/tmp/smollora

On the GPU server:

cp -r /tmp/smollora ~/smollora
cd ~/smollora

Step 4: Build & Start Container

cd ~/smollora
docker compose build
docker compose up -d

Verify it's running:

docker compose ps
docker compose logs --tail=20

Step 5: Exec In & Kick Off Training

docker compose exec smollora bash

Inside the container, run the pipeline:

/app/run.sh

Watch for:

✅ Datasets downloading successfully
✅ Samples counted (should be thousands)
✅ Model loading without OOM
✅ First few training steps completing (check loss is decreasing)
✅ No CUDA OOM errors in first 50 steps

If data prep already ran and you just want to re-train:

SKIP_PREP=1 /app/run.sh

Step 6: Monitor

From the host (no need to stay in the container):

# Follow logs
docker compose logs -f

# GPU utilization
watch -n5 nvidia-smi

Expected timeline on a single A100:

Data prep: ~10-20 min
Training (3 epochs, ~20-40k samples): ~2-4 hours
Could be longer on smaller GPUs

Step 7: Verify Output

# Check the LoRA adapter was saved
ls -la /srv/smollora/output/final/
# Should see: adapter_config.json, adapter_model.safetensors, tokenizer files

Step 8: Notify Mike

Message Mike:

🎭 LoRA training is running on the GPU box. Data prep done, [N] samples, training started at [time]. Estimated completion: [est]. I'll check back periodically — will ping you if anything blows up or when it finishes.

Troubleshooting

Problem	Fix
CUDA OOM	Reduce `BATCH_SIZE` to 2, increase `GRAD_ACCUM` to 8, or reduce `MAX_LENGTH` to 2048
Dataset download fails	Check internet; can pre-download and mount into `/data`
Docker can't see GPU	Install nvidia-container-toolkit: `sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker`
Training loss not decreasing	Check LR — try `1e-4` or `5e-5`; verify labels aren't all -100
Disk full	Clean up `/srv/smollora/hf-cache` after model loads; processed data is small

Rollback

If everything goes sideways:

docker compose down
rm -rf /srv/smollora/output/*
# Fix whatever broke, then:
docker compose up -d

3.0 KiB Raw Blame History