Skip to main content

Runbook: Database Backup and Restore

Historical Runbook

This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.

Severity: CRITICAL Time Required: 15-30 minutes Owner: Platform Team


Overview

Procedures for backing up and restoring Control Center database and client databases.


Control Center Database

Backup Procedure

1. Manual Backup

# Get database pod name
DB_POD=$(kubectl get pods -n hq -l app=control-center-db -o jsonpath='{.items[0].metadata.name}')

# Create backup
kubectl exec -n hq $DB_POD -- pg_dump -U postgres control_center > control_center_backup_$(date +%Y%m%d_%H%M%S).sql

# Verify backup file
ls -lh control_center_backup_*.sql

2. Automated Daily Backup (CronJob)

Create backup CronJob:

# Save as: k8s/base/control-center-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: control-center-db-backup
namespace: hq
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15-alpine
command:
- /bin/sh
- -c
- |
pg_dump -h control-center-db -U postgres control_center | \
gzip > /backup/control_center_$(date +%Y%m%d_%H%M%S).sql.gz
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: control-center-db-secret
key: password
volumeMounts:
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailure

Apply:

kubectl apply -f k8s/base/control-center-backup-cronjob.yaml

Restore Procedure

From Local Backup:

# Copy backup to pod
kubectl cp control_center_backup.sql hq/$DB_POD:/tmp/

# Restore
kubectl exec -n hq $DB_POD -- psql -U postgres control_center < /tmp/control_center_backup.sql

Full Restore Process:

# 1. Stop control-center
kubectl scale deployment control-center --replicas=0 -n hq

# 2. Drop and recreate database
kubectl exec -n hq $DB_POD -- psql -U postgres -c "DROP DATABASE IF EXISTS control_center;"
kubectl exec -n hq $DB_POD -- psql -U postgres -c "CREATE DATABASE control_center;"

# 3. Restore from backup
cat control_center_backup.sql | kubectl exec -i -n hq $DB_POD -- psql -U postgres control_center

# 4. Verify tables exist
kubectl exec -n hq $DB_POD -- psql -U postgres control_center -c "\dt"

# 5. Restart control-center
kubectl scale deployment control-center --replicas=1 -n hq

# 6. Verify
curl https://control-center.egintegrations.com/api/engines

# 7. Post status in Slack
# Post in #alerts-info: "Database restored successfully from backup"

Client Databases

Backup Individual Client Database

# Get database pod for client
DB_POD=$(kubectl get pods -n client-{client-slug} -l app=postgres -o jsonpath='{.items[0].metadata.name}')

# Create backup
kubectl exec -n client-{client-slug} $DB_POD -- pg_dump -U postgres `{db_name}` > `{client}`_backup_$(date +%Y%m%d).sql

Restore Client Database

# Copy backup to pod
kubectl cp `{client}`_backup.sql client-{client-slug}/$DB_POD:/tmp/

# Restore
kubectl exec -n client-{client-slug} $DB_POD -- psql -U postgres `{db_name}` < /tmp/`{client}`_backup.sql

Backup to Cloud Storage

S3/DigitalOcean Spaces

1. Create Backup Script

# Save as: scripts/backup-to-s3.sh
#!/bin/bash

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="control_center_${TIMESTAMP}.sql.gz"
S3_BUCKET="s3://egi-platform-backups"

# Create backup
kubectl exec -n hq $DB_POD -- pg_dump -U postgres control_center | gzip > /tmp/$BACKUP_FILE

# Upload to S3
aws s3 cp /tmp/$BACKUP_FILE $S3_BUCKET/control-center/

# Clean up local file
rm /tmp/$BACKUP_FILE

# Keep only last 30 days
aws s3 ls $S3_BUCKET/control-center/ | \
awk '{print $4}' | \
sort -r | \
tail -n +31 | \
xargs -I {} aws s3 rm $S3_BUCKET/control-center/{}

# Post success notification to Slack
# (Use Slack webhook or API)

2. Make Executable

chmod +x scripts/backup-to-s3.sh

3. Schedule with Cron

# Add to crontab
crontab -e

# Add line:
0 3 * * * /Users/elliottgodwin/Desktop/egi-engine/scripts/backup-to-s3.sh

Disaster Recovery

Complete Platform Restore

Scenario: Complete cluster failure, need to restore everything

Prerequisites:

  • New Kubernetes cluster
  • Latest backups available
  • Access credentials

Alert Team: Post in Slack #alerts-critical: "Starting disaster recovery procedure"

Steps:

  1. Restore Infrastructure

    # Apply base manifests
    kubectl apply -f k8s/base/

    # Install Argo CD
    # (See main setup docs)
  2. Restore Database

    # Wait for database pod
    kubectl wait --for=condition=ready pod -l app=control-center-db -n hq --timeout=5m

    # Get pod name
    DB_POD=$(kubectl get pods -n hq -l app=control-center-db -o jsonpath='{.items[0].metadata.name}')

    # Restore from latest backup
    cat control_center_backup_latest.sql | kubectl exec -i -n hq $DB_POD -- psql -U postgres control_center
  3. Deploy Control Center

    # Sync with Argo CD
    argocd app sync control-center
  4. Verify

    # Check engines
    curl https://control-center.egintegrations.com/api/engines

    # Check clients
    curl https://control-center.egintegrations.com/api/clients
  5. Update Monitoring

    • Reconfigure Uptime Kuma monitors
    • Verify Slack alerts working
    • Check PostHog analytics connectivity
    • Validate CrowdSec security monitoring
  6. Post Completion Post in Slack #alerts-critical: "Disaster recovery complete. All systems operational."


Backup Retention Policy

Recommended Schedule:

  • Hourly: Keep last 24
  • Daily: Keep last 30 days
  • Weekly: Keep last 12 weeks
  • Monthly: Keep last 12 months
  • Yearly: Keep indefinitely

Implement with Script:

#!/bin/bash
# backup-with-retention.sh

BACKUP_DIR="/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Create backup
pg_dump ... > $BACKUP_DIR/hourly/backup_$TIMESTAMP.sql

# Retention cleanup
find $BACKUP_DIR/hourly -mtime +1 -delete # > 24 hours
find $BACKUP_DIR/daily -mtime +30 -delete # > 30 days
find $BACKUP_DIR/weekly -mtime +84 -delete # > 12 weeks
find $BACKUP_DIR/monthly -mtime +365 -delete # > 12 months

Monitoring Backups

Verify Backup Success

# Check last backup
ls -lht control_center_backup_*.sql | head -1

# Verify backup is not empty
if [ $(wc -l < control_center_backup_latest.sql) -lt 100 ]; then
echo "WARNING: Backup seems too small!"
# Post alert to Slack #alerts-warning
fi

Uptime Kuma Backup Monitoring

Create a monitor in Uptime Kuma:

  1. Monitor Type: "Keyword"
  2. URL: Backup status endpoint (if available)
  3. Keyword: "success"
  4. Alert if keyword not found
  5. Notification: Slack #alerts-critical

Alert on Backup Failure

Slack Alert Configuration:

Set up alerts for:

  • Backup job failed → Slack #alerts-critical
  • No backup in 24 hours → Slack #alerts-warning
  • Backup size suspiciously small → Slack #alerts-warning
  • S3 upload failed → Slack #alerts-critical

Testing Restores

Regular Restore Tests

Monthly Test (First Sunday):

  1. Create test namespace

    kubectl create namespace test-restore
  2. Restore to test namespace

    # Deploy test database
    kubectl apply -f k8s/test/postgres.yaml -n test-restore

    # Restore backup
    cat control_center_backup_latest.sql | \
    kubectl exec -i -n test-restore $TEST_DB_POD -- \
    psql -U postgres test_db
  3. Verify data integrity

    # Check row counts
    kubectl exec -n test-restore $TEST_DB_POD -- psql -U postgres test_db -c "
    SELECT
    'engines' as table_name, COUNT(*) as rows FROM engines
    UNION ALL
    SELECT 'clients', COUNT(*) FROM clients
    UNION ALL
    SELECT 'client_invoices', COUNT(*) FROM client_invoices;
    "
  4. Clean up

    kubectl delete namespace test-restore
  5. Document test Post in Slack #alerts-info: "Monthly backup restore test successful"


Backup Checklist

Before Major Changes:

  • Create manual backup
  • Verify backup completed successfully
  • Test restore in test environment
  • Document backup location
  • Post in Slack #alerts-info with backup location
  • Proceed with changes

After Incident:

  • Restore from backup
  • Verify data integrity
  • Check all services running via Uptime Kuma
  • Verify PostHog analytics working
  • Check CrowdSec security monitoring
  • Document what was restored
  • Post incident summary in Slack #alerts-critical
  • Update incident report

Emergency Contacts

If backup fails:

  1. Check disk space: kubectl top nodes
  2. Check pod logs: kubectl logs -n hq $DB_POD
  3. Check CronJob status: kubectl get cronjobs -n hq
  4. Manual backup: Follow manual procedure above
  5. Post in Slack #alerts-critical immediately

If restore fails:

  1. Check backup file integrity
  2. Verify database is accessible
  3. Check for version mismatches
  4. Try older backup
  5. Post in Slack #alerts-critical with details
  6. Contact database administrator

Monitoring Integration

Uptime Kuma

  • Monitor: Backup job completion endpoint
  • Alert: Failed backup job
  • Notification: Slack #alerts-critical

PostHog

  • Track: Backup events (success/failure)
  • Dashboard: Backup reliability metrics
  • Alert: Backup pattern anomalies

CrowdSec

  • Monitor: Unauthorized access to backup storage
  • Alert: Suspicious backup file access
  • Action: Auto-ban suspicious IPs

Last Updated: 2026-03-25 Version: 1.1