Runbook: Database Backup and Restore
This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.
Severity: CRITICAL Time Required: 15-30 minutes Owner: Platform Team
Overview
Procedures for backing up and restoring Control Center database and client databases.
Control Center Database
Backup Procedure
1. Manual Backup
# Get database pod name
DB_POD=$(kubectl get pods -n hq -l app=control-center-db -o jsonpath='{.items[0].metadata.name}')
# Create backup
kubectl exec -n hq $DB_POD -- pg_dump -U postgres control_center > control_center_backup_$(date +%Y%m%d_%H%M%S).sql
# Verify backup file
ls -lh control_center_backup_*.sql
2. Automated Daily Backup (CronJob)
Create backup CronJob:
# Save as: k8s/base/control-center-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: control-center-db-backup
namespace: hq
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15-alpine
command:
- /bin/sh
- -c
- |
pg_dump -h control-center-db -U postgres control_center | \
gzip > /backup/control_center_$(date +%Y%m%d_%H%M%S).sql.gz
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: control-center-db-secret
key: password
volumeMounts:
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailure
Apply:
kubectl apply -f k8s/base/control-center-backup-cronjob.yaml
Restore Procedure
From Local Backup:
# Copy backup to pod
kubectl cp control_center_backup.sql hq/$DB_POD:/tmp/
# Restore
kubectl exec -n hq $DB_POD -- psql -U postgres control_center < /tmp/control_center_backup.sql
Full Restore Process:
# 1. Stop control-center
kubectl scale deployment control-center --replicas=0 -n hq
# 2. Drop and recreate database
kubectl exec -n hq $DB_POD -- psql -U postgres -c "DROP DATABASE IF EXISTS control_center;"
kubectl exec -n hq $DB_POD -- psql -U postgres -c "CREATE DATABASE control_center;"
# 3. Restore from backup
cat control_center_backup.sql | kubectl exec -i -n hq $DB_POD -- psql -U postgres control_center
# 4. Verify tables exist
kubectl exec -n hq $DB_POD -- psql -U postgres control_center -c "\dt"
# 5. Restart control-center
kubectl scale deployment control-center --replicas=1 -n hq
# 6. Verify
curl https://control-center.egintegrations.com/api/engines
# 7. Post status in Slack
# Post in #alerts-info: "Database restored successfully from backup"
Client Databases
Backup Individual Client Database
# Get database pod for client
DB_POD=$(kubectl get pods -n client-{client-slug} -l app=postgres -o jsonpath='{.items[0].metadata.name}')
# Create backup
kubectl exec -n client-{client-slug} $DB_POD -- pg_dump -U postgres `{db_name}` > `{client}`_backup_$(date +%Y%m%d).sql
Restore Client Database
# Copy backup to pod
kubectl cp `{client}`_backup.sql client-{client-slug}/$DB_POD:/tmp/
# Restore
kubectl exec -n client-{client-slug} $DB_POD -- psql -U postgres `{db_name}` < /tmp/`{client}`_backup.sql
Backup to Cloud Storage
S3/DigitalOcean Spaces
1. Create Backup Script
# Save as: scripts/backup-to-s3.sh
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="control_center_${TIMESTAMP}.sql.gz"
S3_BUCKET="s3://egi-platform-backups"
# Create backup
kubectl exec -n hq $DB_POD -- pg_dump -U postgres control_center | gzip > /tmp/$BACKUP_FILE
# Upload to S3
aws s3 cp /tmp/$BACKUP_FILE $S3_BUCKET/control-center/
# Clean up local file
rm /tmp/$BACKUP_FILE
# Keep only last 30 days
aws s3 ls $S3_BUCKET/control-center/ | \
awk '{print $4}' | \
sort -r | \
tail -n +31 | \
xargs -I {} aws s3 rm $S3_BUCKET/control-center/{}
# Post success notification to Slack
# (Use Slack webhook or API)
2. Make Executable
chmod +x scripts/backup-to-s3.sh
3. Schedule with Cron
# Add to crontab
crontab -e
# Add line:
0 3 * * * /Users/elliottgodwin/Desktop/egi-engine/scripts/backup-to-s3.sh
Disaster Recovery
Complete Platform Restore
Scenario: Complete cluster failure, need to restore everything
Prerequisites:
- New Kubernetes cluster
- Latest backups available
- Access credentials
Alert Team:
Post in Slack #alerts-critical: "Starting disaster recovery procedure"
Steps:
-
Restore Infrastructure
# Apply base manifests
kubectl apply -f k8s/base/
# Install Argo CD
# (See main setup docs) -
Restore Database
# Wait for database pod
kubectl wait --for=condition=ready pod -l app=control-center-db -n hq --timeout=5m
# Get pod name
DB_POD=$(kubectl get pods -n hq -l app=control-center-db -o jsonpath='{.items[0].metadata.name}')
# Restore from latest backup
cat control_center_backup_latest.sql | kubectl exec -i -n hq $DB_POD -- psql -U postgres control_center -
Deploy Control Center
# Sync with Argo CD
argocd app sync control-center -
Verify
# Check engines
curl https://control-center.egintegrations.com/api/engines
# Check clients
curl https://control-center.egintegrations.com/api/clients -
Update Monitoring
- Reconfigure Uptime Kuma monitors
- Verify Slack alerts working
- Check PostHog analytics connectivity
- Validate CrowdSec security monitoring
-
Post Completion Post in Slack
#alerts-critical: "Disaster recovery complete. All systems operational."
Backup Retention Policy
Recommended Schedule:
- Hourly: Keep last 24
- Daily: Keep last 30 days
- Weekly: Keep last 12 weeks
- Monthly: Keep last 12 months
- Yearly: Keep indefinitely
Implement with Script:
#!/bin/bash
# backup-with-retention.sh
BACKUP_DIR="/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Create backup
pg_dump ... > $BACKUP_DIR/hourly/backup_$TIMESTAMP.sql
# Retention cleanup
find $BACKUP_DIR/hourly -mtime +1 -delete # > 24 hours
find $BACKUP_DIR/daily -mtime +30 -delete # > 30 days
find $BACKUP_DIR/weekly -mtime +84 -delete # > 12 weeks
find $BACKUP_DIR/monthly -mtime +365 -delete # > 12 months
Monitoring Backups
Verify Backup Success
# Check last backup
ls -lht control_center_backup_*.sql | head -1
# Verify backup is not empty
if [ $(wc -l < control_center_backup_latest.sql) -lt 100 ]; then
echo "WARNING: Backup seems too small!"
# Post alert to Slack #alerts-warning
fi
Uptime Kuma Backup Monitoring
Create a monitor in Uptime Kuma:
- Monitor Type: "Keyword"
- URL: Backup status endpoint (if available)
- Keyword: "success"
- Alert if keyword not found
- Notification: Slack
#alerts-critical
Alert on Backup Failure
Slack Alert Configuration:
Set up alerts for:
- Backup job failed → Slack
#alerts-critical - No backup in 24 hours → Slack
#alerts-warning - Backup size suspiciously small → Slack
#alerts-warning - S3 upload failed → Slack
#alerts-critical
Testing Restores
Regular Restore Tests
Monthly Test (First Sunday):
-
Create test namespace
kubectl create namespace test-restore -
Restore to test namespace
# Deploy test database
kubectl apply -f k8s/test/postgres.yaml -n test-restore
# Restore backup
cat control_center_backup_latest.sql | \
kubectl exec -i -n test-restore $TEST_DB_POD -- \
psql -U postgres test_db -
Verify data integrity
# Check row counts
kubectl exec -n test-restore $TEST_DB_POD -- psql -U postgres test_db -c "
SELECT
'engines' as table_name, COUNT(*) as rows FROM engines
UNION ALL
SELECT 'clients', COUNT(*) FROM clients
UNION ALL
SELECT 'client_invoices', COUNT(*) FROM client_invoices;
" -
Clean up
kubectl delete namespace test-restore -
Document test Post in Slack
#alerts-info: "Monthly backup restore test successful"
Backup Checklist
Before Major Changes:
- Create manual backup
- Verify backup completed successfully
- Test restore in test environment
- Document backup location
- Post in Slack
#alerts-infowith backup location - Proceed with changes
After Incident:
- Restore from backup
- Verify data integrity
- Check all services running via Uptime Kuma
- Verify PostHog analytics working
- Check CrowdSec security monitoring
- Document what was restored
- Post incident summary in Slack
#alerts-critical - Update incident report
Emergency Contacts
If backup fails:
- Check disk space:
kubectl top nodes - Check pod logs:
kubectl logs -n hq $DB_POD - Check CronJob status:
kubectl get cronjobs -n hq - Manual backup: Follow manual procedure above
- Post in Slack
#alerts-criticalimmediately
If restore fails:
- Check backup file integrity
- Verify database is accessible
- Check for version mismatches
- Try older backup
- Post in Slack
#alerts-criticalwith details - Contact database administrator
Monitoring Integration
Uptime Kuma
- Monitor: Backup job completion endpoint
- Alert: Failed backup job
- Notification: Slack
#alerts-critical
PostHog
- Track: Backup events (success/failure)
- Dashboard: Backup reliability metrics
- Alert: Backup pattern anomalies
CrowdSec
- Monitor: Unauthorized access to backup storage
- Alert: Suspicious backup file access
- Action: Auto-ban suspicious IPs
Last Updated: 2026-03-25 Version: 1.1