# SmolLM3-3B LoRA Training — Deployment Runbook ## Prerequisites - [ ] GPU server deployed and accessible via SSH - [ ] SSH creds from Mike (host, user, key/password) - [ ] Docker + Docker Compose + NVIDIA Container Toolkit installed on GPU server - [ ] This repo at `/home/openclaw/dev/smollora` ## Step 1: SSH In & Prep the Host ```bash ssh @ ``` Verify GPU is visible: ```bash nvidia-smi docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi ``` If Docker/nvidia toolkit missing, install before continuing. ## Step 2: Create Persistent Directories ```bash sudo mkdir -p /srv/smollora/{data,output,hf-cache} sudo chown -R $(whoami):$(whoami) /srv/smollora ``` ## Step 3: Copy Project Files to Server From the OpenClaw box: ```bash scp -r /home/openclaw/dev/smollora @:/tmp/smollora ``` On the GPU server: ```bash cp -r /tmp/smollora ~/smollora cd ~/smollora ``` ## Step 4: Build & Start Container ```bash cd ~/smollora docker compose build docker compose up -d ``` Verify it's running: ```bash docker compose ps docker compose logs --tail=20 ``` ## Step 5: Exec In & Kick Off Training ```bash docker compose exec smollora bash ``` Inside the container, run the pipeline: ```bash /app/run.sh ``` Watch for: - ✅ Datasets downloading successfully - ✅ Samples counted (should be thousands) - ✅ Model loading without OOM - ✅ First few training steps completing (check loss is decreasing) - ✅ No CUDA OOM errors in first 50 steps If data prep already ran and you just want to re-train: ```bash SKIP_PREP=1 /app/run.sh ``` ## Step 6: Monitor From the host (no need to stay in the container): ```bash # Follow logs docker compose logs -f # GPU utilization watch -n5 nvidia-smi ``` Expected timeline on a single A100: - Data prep: ~10-20 min - Training (3 epochs, ~20-40k samples): ~2-4 hours - Could be longer on smaller GPUs ## Step 7: Verify Output ```bash # Check the LoRA adapter was saved ls -la /srv/smollora/output/final/ # Should see: adapter_config.json, adapter_model.safetensors, tokenizer files ``` ## Step 8: Notify Mike Message Mike: > 🎭 LoRA training is running on the GPU box. Data prep done, [N] samples, training started at [time]. Estimated completion: [est]. I'll check back periodically — will ping you if anything blows up or when it finishes. ## Troubleshooting | Problem | Fix | |---------|-----| | CUDA OOM | Reduce `BATCH_SIZE` to 2, increase `GRAD_ACCUM` to 8, or reduce `MAX_LENGTH` to 2048 | | Dataset download fails | Check internet; can pre-download and mount into `/data` | | Docker can't see GPU | Install nvidia-container-toolkit: `sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker` | | Training loss not decreasing | Check LR — try `1e-4` or `5e-5`; verify labels aren't all -100 | | Disk full | Clean up `/srv/smollora/hf-cache` after model loads; processed data is small | ## Rollback If everything goes sideways: ```bash docker compose down rm -rf /srv/smollora/output/* # Fix whatever broke, then: docker compose up -d ```