Skip to main content

Runbook: Engine Down / Unhealthy

Historical Runbook

This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.

Severity: HIGH Response Time: 15 minutes Owner: Platform Team


Symptoms

  • Engine showing red status in Control Center
  • Uptime Kuma monitoring showing service down
  • Alert in Slack #alerts-critical channel: "Engine {engine_id} is down"
  • Health check endpoint returning 503 or timing out
  • PostHog analytics showing no activity from engine

Initial Response (5 minutes)

1. Verify the Issue

# Check engine status in Control Center
curl https://control-center.egintegrations.com/api/engines/``{engine_id}``

# Check pod status
kubectl get pods -n ``{namespace}``

# Expected: Should show Pod status

2. Check Pod State

kubectl get pods -n ``{namespace}`` -l app=``{engine_name}``

Possible States:

  • Running but unhealthy → Go to "Health Check Failing"
  • Pending → Go to "Pod Pending"
  • CrashLoopBackOff → Go to "Application Crash"
  • ImagePullBackOff → Go to "Image Pull Failure"
  • Error / Completed → Go to "Pod Failed"

Health Check Failing (Pod Running but Unhealthy)

Symptoms

  • Pod status: Running
  • Readiness: 0/1
  • Health endpoint returning errors

Investigation

# Check recent logs
kubectl logs -n ``{namespace}`` deployment/``{engine_name}`` --tail=100

# Check live logs
kubectl logs -f -n ``{namespace}`` deployment/``{engine_name}``

# Test health endpoint directly
kubectl port-forward -n ``{namespace}`` svc/``{engine_name}`` 8080:80
curl http://localhost:8080/.well-known/engine-status

Common Causes & Fixes

1. Database Connection Failure

# Check database pod
kubectl get pods -n ``{namespace}`` | grep -i postgres

# Check database logs
kubectl logs -n ``{namespace}`` `{db_pod_name}`

# Fix: Restart database
kubectl rollout restart statefulset/`{db_name}` -n ``{namespace}``

2. Out of Memory

# Check memory usage
kubectl top pod -n ``{namespace}`` `{pod_name}`

# Fix: Increase memory limit
# Edit Helm values:
resources:
limits:
memory: 2Gi # Increase from 1Gi

# Apply
helm upgrade `{release_name}` ./chart -n ``{namespace}``

3. Dependency Service Down

# Check all services in namespace
kubectl get svc -n ``{namespace}``

# Test service connectivity from pod
kubectl exec -it -n ``{namespace}`` `{pod_name}` -- curl http://`{service_name}`

# Fix: Restart dependency
kubectl rollout restart deployment/`{dependency_name}` -n ``{namespace}``

Pod Pending

Symptoms

  • Pod status: Pending
  • Not scheduled to any node

Investigation

# Describe pod for scheduling errors
kubectl describe pod -n ``{namespace}`` `{pod_name}`

# Check for events at bottom of output
# Look for: "FailedScheduling", "Insufficient cpu", "Insufficient memory"

Common Causes & Fixes

1. Insufficient Resources

# Check node resources
kubectl top nodes

# Check resource requests
kubectl describe pod -n ``{namespace}`` `{pod_name}` | grep -A 5 "Requests:"

# Fix Option A: Reduce resource requests
resources:
requests:
cpu: 100m # Reduce from 500m
memory: 256Mi # Reduce from 1Gi

# Fix Option B: Add more nodes
# Go to DigitalOcean dashboard
# Resize node pool or add nodes

2. No Node Matches Selector

# Check node selector
kubectl get pod -n ``{namespace}`` `{pod_name}` -o yaml | grep -A 3 nodeSelector

# Fix: Remove restrictive node selector
# Or ensure nodes have required labels

Application Crash (CrashLoopBackOff)

Symptoms

  • Pod status: CrashLoopBackOff
  • Container repeatedly restarting

Investigation

# Get recent logs
kubectl logs -n ``{namespace}`` `{pod_name}` --previous

# Check restart count
kubectl get pod -n ``{namespace}`` `{pod_name}`

# Describe pod for exit code
kubectl describe pod -n ``{namespace}`` `{pod_name}` | grep "Exit Code"

Common Exit Codes

  • Exit Code 137: Out of Memory (OOMKilled)
  • Exit Code 1: Application error
  • Exit Code 2: Misconfiguration
  • Exit Code 139: Segmentation fault

Fixes by Exit Code

Exit Code 137 (OOMKilled):

# Increase memory
resources:
limits:
memory: 2Gi # Double current limit

helm upgrade `{release_name}` ./chart -n ``{namespace}``

Exit Code 1 (Application Error):

# Check logs for stack trace
kubectl logs -n ``{namespace}`` `{pod_name}` --previous | tail -50

# Common issues:
# - Missing environment variables
# - Invalid configuration
# - Database connection failure

# Fix: Update environment variables or config
kubectl edit deployment/``{engine_name}`` -n ``{namespace}``

Image Pull Failure

Symptoms

  • Pod status: ImagePullBackOff or ErrImagePull
  • Cannot pull Docker image

Investigation

# Check image name
kubectl get pod -n ``{namespace}`` `{pod_name}` -o jsonpath='{.spec.containers[0].image}'

# Check pull errors
kubectl describe pod -n ``{namespace}`` `{pod_name}` | grep -A 5 "Failed to pull image"

Common Causes & Fixes

1. Image Doesn't Exist

# Verify image exists in GHCR
docker pull ghcr.io/`{org}`/`{image}`:`{tag}`

# Fix: Build and push image
cd /path/to/repo
docker build -t ghcr.io/`{org}`/`{image}`:`{tag}` .
docker push ghcr.io/`{org}`/`{image}`:`{tag}`

2. Private Registry Authentication

# Check image pull secret exists
kubectl get secrets -n ``{namespace}`` | grep regcred

# Fix: Create image pull secret
kubectl create secret docker-registry regcred \
--docker-server=ghcr.io \
--docker-username=`{github_username}` \
--docker-password=`{github_token}` \
-n ``{namespace}``

# Update deployment to use secret
kubectl patch serviceaccount default \
-p '{"imagePullSecrets": [{"name": "regcred"}]}' \
-n ``{namespace}``

3. Wrong Image Tag

# Check available tags
# Go to: https://github.com/orgs/`{org}`/packages/`{package}`

# Fix: Update to correct tag
helm upgrade `{release_name}` ./chart \
--set image.tag=`{correct_tag}` \
-n ``{namespace}``

Pod Failed

Symptoms

  • Pod status: Error or Completed
  • Pod exited and didn't restart

Investigation

# Get pod logs
kubectl logs -n ``{namespace}`` `{pod_name}`

# Check pod events
kubectl describe pod -n ``{namespace}`` `{pod_name}`

Fix

# Delete pod to trigger recreation
kubectl delete pod -n ``{namespace}`` `{pod_name}`

# Or restart deployment
kubectl rollout restart deployment/``{engine_name}`` -n ``{namespace}``

Quick Fixes (Try These First)

1. Restart Pod

kubectl rollout restart deployment/``{engine_name}`` -n ``{namespace}``

# Wait for rollout
kubectl rollout status deployment/``{engine_name}`` -n ``{namespace}``

2. Force Pull Latest Image

# Delete pod to force image pull
kubectl delete pod -n ``{namespace}`` `{pod_name}`

# Or edit image pull policy
kubectl set image deployment/``{engine_name}`` \
`{container_name}`=ghcr.io/`{org}`/`{image}`:`{tag}` \
--record \
-n ``{namespace}``

3. Scale Down and Up

# Scale to 0
kubectl scale deployment/``{engine_name}`` --replicas=0 -n ``{namespace}``

# Wait 10 seconds
sleep 10

# Scale back up
kubectl scale deployment/``{engine_name}`` --replicas=1 -n ``{namespace}``

Prevention

Before Deploying

  1. Test Locally

    docker run -p 8000:8000 ghcr.io/`{org}`/`{image}`:`{tag}`
    curl http://localhost:8000/.well-known/engine-status
  2. Check Resource Limits

    • Ensure limits match application needs
    • Test with load
  3. Verify Image Exists

    docker pull ghcr.io/`{org}`/`{image}`:`{tag}`

Monitoring

  1. Set Up Alerts

    • Configure Uptime Kuma monitors for each engine
    • Alert in Slack #alerts-critical when engine down >5 minutes
    • Alert on high restart count
    • Monitor via PostHog analytics for activity drops
  2. Regular Health Checks

    • Monitor Control Center dashboard daily
    • Review Uptime Kuma status page

Escalation

If issue persists after 30 minutes:

  1. Check Cluster Health

    kubectl get nodes
    kubectl top nodes
  2. Review Recent Changes

    • Check GitHub commits
    • Check Argo CD sync history
  3. Contact Support

    • Post in Slack #alerts-critical with details
    • Gather diagnostic bundle:
      kubectl logs -n ``{namespace}`` deployment/``{engine_name}`` > logs.txt
      kubectl describe pod -n ``{namespace}`` `{pod_name}` > describe.txt
      kubectl get events -n ``{namespace}`` --sort-by='.lastTimestamp' > events.txt

Post-Incident

1. Document Resolution

  • What caused the issue?
  • What fixed it?
  • How long was it down?
  • Post summary in Slack #alerts-critical

2. Update Monitoring

  • Add Uptime Kuma monitor if not caught
  • Adjust alert thresholds if needed
  • Configure CrowdSec rules if security-related

3. Prevent Recurrence

  • Fix root cause
  • Add automated checks
  • Update this runbook

Checklist

  • Verified pod status
  • Checked pod logs
  • Attempted restart
  • Verified fix (engine shows green in Uptime Kuma & Control Center)
  • Updated incident log
  • Posted resolution in Slack
  • Reviewed prevention steps

Last Updated: 2026-03-25 Version: 1.1