Runbook: Engine Down / Unhealthy

Historical Runbook

This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.

Severity: HIGH Response Time: 15 minutes Owner: Platform Team

Symptoms

Engine showing red status in Control Center
Uptime Kuma monitoring showing service down
Alert in Slack #alerts-critical channel: "Engine {engine_id} is down"
Health check endpoint returning 503 or timing out
PostHog analytics showing no activity from engine

Initial Response (5 minutes)

1. Verify the Issue

# Check engine status in Control Center
curl https://control-center.egintegrations.com/api/engines/``{engine_id}``

# Check pod status
kubectl get pods -n ``{namespace}``

# Expected: Should show Pod status

2. Check Pod State

kubectl get pods -n ``{namespace}`` -l app=``{engine_name}``

Possible States:

Running but unhealthy → Go to "Health Check Failing"
Pending → Go to "Pod Pending"
CrashLoopBackOff → Go to "Application Crash"
ImagePullBackOff → Go to "Image Pull Failure"
Error / Completed → Go to "Pod Failed"

Health Check Failing (Pod Running but Unhealthy)

Symptoms

Pod status: Running
Readiness: 0/1
Health endpoint returning errors

Investigation

# Check recent logs
kubectl logs -n ``{namespace}`` deployment/``{engine_name}`` --tail=100

# Check live logs
kubectl logs -f -n ``{namespace}`` deployment/``{engine_name}``

# Test health endpoint directly
kubectl port-forward -n ``{namespace}`` svc/``{engine_name}`` 8080:80
curl http://localhost:8080/.well-known/engine-status

Common Causes & Fixes

1. Database Connection Failure

# Check database pod
kubectl get pods -n ``{namespace}`` | grep -i postgres

# Check database logs
kubectl logs -n ``{namespace}`` `{db_pod_name}`

# Fix: Restart database
kubectl rollout restart statefulset/`{db_name}` -n ``{namespace}``

2. Out of Memory

# Check memory usage
kubectl top pod -n ``{namespace}`` `{pod_name}`

# Fix: Increase memory limit
# Edit Helm values:
resources:
  limits:
    memory: 2Gi  # Increase from 1Gi

# Apply
helm upgrade `{release_name}` ./chart -n ``{namespace}``

3. Dependency Service Down

# Check all services in namespace
kubectl get svc -n ``{namespace}``

# Test service connectivity from pod
kubectl exec -it -n ``{namespace}`` `{pod_name}` -- curl http://`{service_name}`

# Fix: Restart dependency
kubectl rollout restart deployment/`{dependency_name}` -n ``{namespace}``

Pod Pending

Symptoms

Pod status: Pending
Not scheduled to any node

Investigation

# Describe pod for scheduling errors
kubectl describe pod -n ``{namespace}`` `{pod_name}`

# Check for events at bottom of output
# Look for: "FailedScheduling", "Insufficient cpu", "Insufficient memory"

Common Causes & Fixes

1. Insufficient Resources

# Check node resources
kubectl top nodes

# Check resource requests
kubectl describe pod -n ``{namespace}`` `{pod_name}` | grep -A 5 "Requests:"

# Fix Option A: Reduce resource requests
resources:
  requests:
    cpu: 100m      # Reduce from 500m
    memory: 256Mi  # Reduce from 1Gi

# Fix Option B: Add more nodes
# Go to DigitalOcean dashboard
# Resize node pool or add nodes

2. No Node Matches Selector

# Check node selector
kubectl get pod -n ``{namespace}`` `{pod_name}` -o yaml | grep -A 3 nodeSelector

# Fix: Remove restrictive node selector
# Or ensure nodes have required labels

Application Crash (CrashLoopBackOff)

Symptoms

Pod status: CrashLoopBackOff
Container repeatedly restarting

Investigation

# Get recent logs
kubectl logs -n ``{namespace}`` `{pod_name}` --previous

# Check restart count
kubectl get pod -n ``{namespace}`` `{pod_name}`

# Describe pod for exit code
kubectl describe pod -n ``{namespace}`` `{pod_name}` | grep "Exit Code"

Common Exit Codes

Exit Code 137: Out of Memory (OOMKilled)
Exit Code 1: Application error
Exit Code 2: Misconfiguration
Exit Code 139: Segmentation fault

Fixes by Exit Code

Exit Code 137 (OOMKilled):

# Increase memory
resources:
  limits:
    memory: 2Gi  # Double current limit

helm upgrade `{release_name}` ./chart -n ``{namespace}``

Exit Code 1 (Application Error):

# Check logs for stack trace
kubectl logs -n ``{namespace}`` `{pod_name}` --previous | tail -50

# Common issues:
# - Missing environment variables
# - Invalid configuration
# - Database connection failure

# Fix: Update environment variables or config
kubectl edit deployment/``{engine_name}`` -n ``{namespace}``

Image Pull Failure

Symptoms

Pod status: ImagePullBackOff or ErrImagePull
Cannot pull Docker image

Investigation

# Check image name
kubectl get pod -n ``{namespace}`` `{pod_name}` -o jsonpath='{.spec.containers[0].image}'

# Check pull errors
kubectl describe pod -n ``{namespace}`` `{pod_name}` | grep -A 5 "Failed to pull image"

Common Causes & Fixes

1. Image Doesn't Exist

# Verify image exists in GHCR
docker pull ghcr.io/`{org}`/`{image}`:`{tag}`

# Fix: Build and push image
cd /path/to/repo
docker build -t ghcr.io/`{org}`/`{image}`:`{tag}` .
docker push ghcr.io/`{org}`/`{image}`:`{tag}`

2. Private Registry Authentication

# Check image pull secret exists
kubectl get secrets -n ``{namespace}`` | grep regcred

# Fix: Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=ghcr.io \
  --docker-username=`{github_username}` \
  --docker-password=`{github_token}` \
  -n ``{namespace}``

# Update deployment to use secret
kubectl patch serviceaccount default \
  -p '{"imagePullSecrets": [{"name": "regcred"}]}' \
  -n ``{namespace}``

3. Wrong Image Tag

# Check available tags
# Go to: https://github.com/orgs/`{org}`/packages/`{package}`

# Fix: Update to correct tag
helm upgrade `{release_name}` ./chart \
  --set image.tag=`{correct_tag}` \
  -n ``{namespace}``

Pod Failed

Symptoms

Pod status: Error or Completed
Pod exited and didn't restart

Investigation

# Get pod logs
kubectl logs -n ``{namespace}`` `{pod_name}`

# Check pod events
kubectl describe pod -n ``{namespace}`` `{pod_name}`

Fix

# Delete pod to trigger recreation
kubectl delete pod -n ``{namespace}`` `{pod_name}`

# Or restart deployment
kubectl rollout restart deployment/``{engine_name}`` -n ``{namespace}``

Quick Fixes (Try These First)

1. Restart Pod

kubectl rollout restart deployment/``{engine_name}`` -n ``{namespace}``

# Wait for rollout
kubectl rollout status deployment/``{engine_name}`` -n ``{namespace}``

2. Force Pull Latest Image

# Delete pod to force image pull
kubectl delete pod -n ``{namespace}`` `{pod_name}`

# Or edit image pull policy
kubectl set image deployment/``{engine_name}`` \
  `{container_name}`=ghcr.io/`{org}`/`{image}`:`{tag}` \
  --record \
  -n ``{namespace}``

3. Scale Down and Up

# Scale to 0
kubectl scale deployment/``{engine_name}`` --replicas=0 -n ``{namespace}``

# Wait 10 seconds
sleep 10

# Scale back up
kubectl scale deployment/``{engine_name}`` --replicas=1 -n ``{namespace}``

Prevention

Before Deploying

Test Locally

docker run -p 8000:8000 ghcr.io/`{org}`/`{image}`:`{tag}`
curl http://localhost:8000/.well-known/engine-status

Check Resource Limits
- Ensure limits match application needs
- Test with load

Verify Image Exists

docker pull ghcr.io/`{org}`/`{image}`:`{tag}`

Monitoring

Set Up Alerts
- Configure Uptime Kuma monitors for each engine
- Alert in Slack #alerts-critical when engine down >5 minutes
- Alert on high restart count
- Monitor via PostHog analytics for activity drops
Regular Health Checks
- Monitor Control Center dashboard daily
- Review Uptime Kuma status page

Escalation

If issue persists after 30 minutes:

Check Cluster Health
```
kubectl get nodes
kubectl top nodes
```
Review Recent Changes
- Check GitHub commits
- Check Argo CD sync history

Contact Support

Post in Slack #alerts-critical with details

Gather diagnostic bundle:

kubectl logs -n ``{namespace}`` deployment/``{engine_name}`` > logs.txt
kubectl describe pod -n ``{namespace}`` `{pod_name}` > describe.txt
kubectl get events -n ``{namespace}`` --sort-by='.lastTimestamp' > events.txt

Post-Incident

1. Document Resolution

What caused the issue?
What fixed it?
How long was it down?
Post summary in Slack #alerts-critical

2. Update Monitoring

Add Uptime Kuma monitor if not caught
Adjust alert thresholds if needed
Configure CrowdSec rules if security-related

3. Prevent Recurrence

Fix root cause
Add automated checks
Update this runbook

Checklist

Last Updated: 2026-03-25 Version: 1.1

Symptoms​

Initial Response (5 minutes)​

1. Verify the Issue​

2. Check Pod State​

Health Check Failing (Pod Running but Unhealthy)​

Symptoms​

Investigation​

Common Causes & Fixes​

Pod Pending​

Symptoms​

Investigation​

Common Causes & Fixes​

Application Crash (CrashLoopBackOff)​

Symptoms​

Investigation​

Common Exit Codes​

Fixes by Exit Code​

Image Pull Failure​

Symptoms​

Investigation​

Common Causes & Fixes​

Pod Failed​

Symptoms​

Investigation​

Fix​

Quick Fixes (Try These First)​

1. Restart Pod​

2. Force Pull Latest Image​

3. Scale Down and Up​

Prevention​

Before Deploying​

Monitoring​

Escalation​

Post-Incident​

1. Document Resolution​

2. Update Monitoring​

3. Prevent Recurrence​

Checklist​

Symptoms

Initial Response (5 minutes)

1. Verify the Issue

2. Check Pod State

Health Check Failing (Pod Running but Unhealthy)

Symptoms

Investigation

Common Causes & Fixes

Pod Pending

Symptoms

Investigation

Common Causes & Fixes

Application Crash (CrashLoopBackOff)

Symptoms

Investigation

Common Exit Codes

Fixes by Exit Code

Image Pull Failure

Symptoms

Investigation

Common Causes & Fixes

Pod Failed

Symptoms

Investigation

Fix

Quick Fixes (Try These First)

1. Restart Pod

2. Force Pull Latest Image

3. Scale Down and Up

Prevention

Before Deploying

Monitoring

Escalation

Post-Incident

1. Document Resolution

2. Update Monitoring

3. Prevent Recurrence

Checklist