Runbook: Database Backup and Restore

Historical Runbook

This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.

Severity: CRITICAL Time Required: 15-30 minutes Owner: Platform Team

Overview

Procedures for backing up and restoring Control Center database and client databases.

Control Center Database

Backup Procedure

1. Manual Backup

# Get database pod name
DB_POD=$(kubectl get pods -n hq -l app=control-center-db -o jsonpath='{.items[0].metadata.name}')

# Create backup
kubectl exec -n hq $DB_POD -- pg_dump -U postgres control_center > control_center_backup_$(date +%Y%m%d_%H%M%S).sql

# Verify backup file
ls -lh control_center_backup_*.sql

2. Automated Daily Backup (CronJob)

Create backup CronJob:

# Save as: k8s/base/control-center-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: control-center-db-backup
  namespace: hq
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: postgres:15-alpine
            command:
            - /bin/sh
            - -c
            - |
              pg_dump -h control-center-db -U postgres control_center | \
              gzip > /backup/control_center_$(date +%Y%m%d_%H%M%S).sql.gz
            env:
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: control-center-db-secret
                  key: password
            volumeMounts:
            - name: backup-storage
              mountPath: /backup
          volumes:
          - name: backup-storage
            persistentVolumeClaim:
              claimName: backup-pvc
          restartPolicy: OnFailure

Apply:

kubectl apply -f k8s/base/control-center-backup-cronjob.yaml

Restore Procedure

From Local Backup:

# Copy backup to pod
kubectl cp control_center_backup.sql hq/$DB_POD:/tmp/

# Restore
kubectl exec -n hq $DB_POD -- psql -U postgres control_center < /tmp/control_center_backup.sql

Full Restore Process:

# 1. Stop control-center
kubectl scale deployment control-center --replicas=0 -n hq

# 2. Drop and recreate database
kubectl exec -n hq $DB_POD -- psql -U postgres -c "DROP DATABASE IF EXISTS control_center;"
kubectl exec -n hq $DB_POD -- psql -U postgres -c "CREATE DATABASE control_center;"

# 3. Restore from backup
cat control_center_backup.sql | kubectl exec -i -n hq $DB_POD -- psql -U postgres control_center

# 4. Verify tables exist
kubectl exec -n hq $DB_POD -- psql -U postgres control_center -c "\dt"

# 5. Restart control-center
kubectl scale deployment control-center --replicas=1 -n hq

# 6. Verify
curl https://control-center.egintegrations.com/api/engines

# 7. Post status in Slack
# Post in #alerts-info: "Database restored successfully from backup"

Client Databases

Backup Individual Client Database

# Get database pod for client
DB_POD=$(kubectl get pods -n client-{client-slug} -l app=postgres -o jsonpath='{.items[0].metadata.name}')

# Create backup
kubectl exec -n client-{client-slug} $DB_POD -- pg_dump -U postgres `{db_name}` > `{client}`_backup_$(date +%Y%m%d).sql

Restore Client Database

# Copy backup to pod
kubectl cp `{client}`_backup.sql client-{client-slug}/$DB_POD:/tmp/

# Restore
kubectl exec -n client-{client-slug} $DB_POD -- psql -U postgres `{db_name}` < /tmp/`{client}`_backup.sql

Backup to Cloud Storage

S3/DigitalOcean Spaces

1. Create Backup Script

# Save as: scripts/backup-to-s3.sh
#!/bin/bash

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="control_center_${TIMESTAMP}.sql.gz"
S3_BUCKET="s3://egi-platform-backups"

# Create backup
kubectl exec -n hq $DB_POD -- pg_dump -U postgres control_center | gzip > /tmp/$BACKUP_FILE

# Upload to S3
aws s3 cp /tmp/$BACKUP_FILE $S3_BUCKET/control-center/

# Clean up local file
rm /tmp/$BACKUP_FILE

# Keep only last 30 days
aws s3 ls $S3_BUCKET/control-center/ | \
  awk '{print $4}' | \
  sort -r | \
  tail -n +31 | \
  xargs -I {} aws s3 rm $S3_BUCKET/control-center/{}

# Post success notification to Slack
# (Use Slack webhook or API)

2. Make Executable

chmod +x scripts/backup-to-s3.sh

3. Schedule with Cron

# Add to crontab
crontab -e

# Add line:
0 3 * * * /Users/elliottgodwin/Desktop/egi-engine/scripts/backup-to-s3.sh

Disaster Recovery

Complete Platform Restore

Scenario: Complete cluster failure, need to restore everything

Prerequisites:

New Kubernetes cluster
Latest backups available
Access credentials

Alert Team: Post in Slack #alerts-critical: "Starting disaster recovery procedure"

Steps:

Restore Infrastructure

# Apply base manifests
kubectl apply -f k8s/base/

# Install Argo CD
# (See main setup docs)

Restore Database

# Wait for database pod
kubectl wait --for=condition=ready pod -l app=control-center-db -n hq --timeout=5m

# Get pod name
DB_POD=$(kubectl get pods -n hq -l app=control-center-db -o jsonpath='{.items[0].metadata.name}')

# Restore from latest backup
cat control_center_backup_latest.sql | kubectl exec -i -n hq $DB_POD -- psql -U postgres control_center

Deploy Control Center

# Sync with Argo CD
argocd app sync control-center

Verify

# Check engines
curl https://control-center.egintegrations.com/api/engines

# Check clients
curl https://control-center.egintegrations.com/api/clients

Update Monitoring
- Reconfigure Uptime Kuma monitors
- Verify Slack alerts working
- Check PostHog analytics connectivity
- Validate CrowdSec security monitoring
Post Completion Post in Slack #alerts-critical: "Disaster recovery complete. All systems operational."

Backup Retention Policy

Recommended Schedule:

Hourly: Keep last 24
Daily: Keep last 30 days
Weekly: Keep last 12 weeks
Monthly: Keep last 12 months
Yearly: Keep indefinitely

Implement with Script:

#!/bin/bash
# backup-with-retention.sh

BACKUP_DIR="/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Create backup
pg_dump ... > $BACKUP_DIR/hourly/backup_$TIMESTAMP.sql

# Retention cleanup
find $BACKUP_DIR/hourly -mtime +1 -delete     # > 24 hours
find $BACKUP_DIR/daily -mtime +30 -delete     # > 30 days
find $BACKUP_DIR/weekly -mtime +84 -delete    # > 12 weeks
find $BACKUP_DIR/monthly -mtime +365 -delete  # > 12 months

Monitoring Backups

Verify Backup Success

# Check last backup
ls -lht control_center_backup_*.sql | head -1

# Verify backup is not empty
if [ $(wc -l < control_center_backup_latest.sql) -lt 100 ]; then
  echo "WARNING: Backup seems too small!"
  # Post alert to Slack #alerts-warning
fi

Uptime Kuma Backup Monitoring

Create a monitor in Uptime Kuma:

Monitor Type: "Keyword"
URL: Backup status endpoint (if available)
Keyword: "success"
Alert if keyword not found
Notification: Slack #alerts-critical

Alert on Backup Failure

Slack Alert Configuration:

Set up alerts for:

Backup job failed → Slack #alerts-critical
No backup in 24 hours → Slack #alerts-warning
Backup size suspiciously small → Slack #alerts-warning
S3 upload failed → Slack #alerts-critical

Testing Restores

Regular Restore Tests

Monthly Test (First Sunday):

Create test namespace
```
kubectl create namespace test-restore
```

Restore to test namespace

# Deploy test database
kubectl apply -f k8s/test/postgres.yaml -n test-restore

# Restore backup
cat control_center_backup_latest.sql | \
  kubectl exec -i -n test-restore $TEST_DB_POD -- \
  psql -U postgres test_db

Verify data integrity

# Check row counts
kubectl exec -n test-restore $TEST_DB_POD -- psql -U postgres test_db -c "
  SELECT
    'engines' as table_name, COUNT(*) as rows FROM engines
    UNION ALL
    SELECT 'clients', COUNT(*) FROM clients
    UNION ALL
    SELECT 'client_invoices', COUNT(*) FROM client_invoices;
"

Clean up
```
kubectl delete namespace test-restore
```
Document test Post in Slack #alerts-info: "Monthly backup restore test successful"

Backup Checklist

Before Major Changes:

Create manual backup
Verify backup completed successfully
Test restore in test environment
Document backup location
Post in Slack #alerts-info with backup location
Proceed with changes

After Incident:

Restore from backup
Verify data integrity
Check all services running via Uptime Kuma
Verify PostHog analytics working
Check CrowdSec security monitoring
Document what was restored
Post incident summary in Slack #alerts-critical
Update incident report

Emergency Contacts

If backup fails:

Check disk space: kubectl top nodes
Check pod logs: kubectl logs -n hq $DB_POD
Check CronJob status: kubectl get cronjobs -n hq
Manual backup: Follow manual procedure above
Post in Slack #alerts-critical immediately

If restore fails:

Check backup file integrity
Verify database is accessible
Check for version mismatches
Try older backup
Post in Slack #alerts-critical with details
Contact database administrator

Monitoring Integration

Uptime Kuma

Monitor: Backup job completion endpoint
Alert: Failed backup job
Notification: Slack #alerts-critical

PostHog

Track: Backup events (success/failure)
Dashboard: Backup reliability metrics
Alert: Backup pattern anomalies

CrowdSec

Monitor: Unauthorized access to backup storage
Alert: Suspicious backup file access
Action: Auto-ban suspicious IPs

Last Updated: 2026-03-25 Version: 1.1

Overview​

Control Center Database​

Backup Procedure​

Restore Procedure​

Client Databases​

Backup Individual Client Database​

Restore Client Database​

Backup to Cloud Storage​

S3/DigitalOcean Spaces​

Disaster Recovery​

Complete Platform Restore​

Backup Retention Policy​

Monitoring Backups​

Verify Backup Success​

Uptime Kuma Backup Monitoring​

Alert on Backup Failure​

Testing Restores​

Regular Restore Tests​

Backup Checklist​

Emergency Contacts​

Monitoring Integration​

Uptime Kuma​

PostHog​

CrowdSec​