Runbook: Engine Down / Unhealthy
Historical Runbook
This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.
Severity: HIGH Response Time: 15 minutes Owner: Platform Team
Symptoms
- Engine showing red status in Control Center
- Uptime Kuma monitoring showing service down
- Alert in Slack
#alerts-criticalchannel: "Engine{engine_id}is down" - Health check endpoint returning 503 or timing out
- PostHog analytics showing no activity from engine
Initial Response (5 minutes)
1. Verify the Issue
# Check engine status in Control Center
curl https://control-center.egintegrations.com/api/engines/``{engine_id}``
# Check pod status
kubectl get pods -n ``{namespace}``
# Expected: Should show Pod status
2. Check Pod State
kubectl get pods -n ``{namespace}`` -l app=``{engine_name}``
Possible States:
Runningbut unhealthy → Go to "Health Check Failing"Pending→ Go to "Pod Pending"CrashLoopBackOff→ Go to "Application Crash"ImagePullBackOff→ Go to "Image Pull Failure"Error/Completed→ Go to "Pod Failed"
Health Check Failing (Pod Running but Unhealthy)
Symptoms
- Pod status:
Running - Readiness:
0/1 - Health endpoint returning errors
Investigation
# Check recent logs
kubectl logs -n ``{namespace}`` deployment/``{engine_name}`` --tail=100
# Check live logs
kubectl logs -f -n ``{namespace}`` deployment/``{engine_name}``
# Test health endpoint directly
kubectl port-forward -n ``{namespace}`` svc/``{engine_name}`` 8080:80
curl http://localhost:8080/.well-known/engine-status
Common Causes & Fixes
1. Database Connection Failure
# Check database pod
kubectl get pods -n ``{namespace}`` | grep -i postgres
# Check database logs
kubectl logs -n ``{namespace}`` `{db_pod_name}`
# Fix: Restart database
kubectl rollout restart statefulset/`{db_name}` -n ``{namespace}``
2. Out of Memory
# Check memory usage
kubectl top pod -n ``{namespace}`` `{pod_name}`
# Fix: Increase memory limit
# Edit Helm values:
resources:
limits:
memory: 2Gi # Increase from 1Gi
# Apply
helm upgrade `{release_name}` ./chart -n ``{namespace}``
3. Dependency Service Down
# Check all services in namespace
kubectl get svc -n ``{namespace}``
# Test service connectivity from pod
kubectl exec -it -n ``{namespace}`` `{pod_name}` -- curl http://`{service_name}`
# Fix: Restart dependency
kubectl rollout restart deployment/`{dependency_name}` -n ``{namespace}``
Pod Pending
Symptoms
- Pod status:
Pending - Not scheduled to any node
Investigation
# Describe pod for scheduling errors
kubectl describe pod -n ``{namespace}`` `{pod_name}`
# Check for events at bottom of output
# Look for: "FailedScheduling", "Insufficient cpu", "Insufficient memory"
Common Causes & Fixes
1. Insufficient Resources
# Check node resources
kubectl top nodes
# Check resource requests
kubectl describe pod -n ``{namespace}`` `{pod_name}` | grep -A 5 "Requests:"
# Fix Option A: Reduce resource requests
resources:
requests:
cpu: 100m # Reduce from 500m
memory: 256Mi # Reduce from 1Gi
# Fix Option B: Add more nodes
# Go to DigitalOcean dashboard
# Resize node pool or add nodes
2. No Node Matches Selector
# Check node selector
kubectl get pod -n ``{namespace}`` `{pod_name}` -o yaml | grep -A 3 nodeSelector
# Fix: Remove restrictive node selector
# Or ensure nodes have required labels
Application Crash (CrashLoopBackOff)
Symptoms
- Pod status:
CrashLoopBackOff - Container repeatedly restarting
Investigation
# Get recent logs
kubectl logs -n ``{namespace}`` `{pod_name}` --previous
# Check restart count
kubectl get pod -n ``{namespace}`` `{pod_name}`
# Describe pod for exit code
kubectl describe pod -n ``{namespace}`` `{pod_name}` | grep "Exit Code"
Common Exit Codes
- Exit Code 137: Out of Memory (OOMKilled)
- Exit Code 1: Application error
- Exit Code 2: Misconfiguration
- Exit Code 139: Segmentation fault
Fixes by Exit Code
Exit Code 137 (OOMKilled):
# Increase memory
resources:
limits:
memory: 2Gi # Double current limit
helm upgrade `{release_name}` ./chart -n ``{namespace}``
Exit Code 1 (Application Error):
# Check logs for stack trace
kubectl logs -n ``{namespace}`` `{pod_name}` --previous | tail -50
# Common issues:
# - Missing environment variables
# - Invalid configuration
# - Database connection failure
# Fix: Update environment variables or config
kubectl edit deployment/``{engine_name}`` -n ``{namespace}``
Image Pull Failure
Symptoms
- Pod status:
ImagePullBackOfforErrImagePull - Cannot pull Docker image
Investigation
# Check image name
kubectl get pod -n ``{namespace}`` `{pod_name}` -o jsonpath='{.spec.containers[0].image}'
# Check pull errors
kubectl describe pod -n ``{namespace}`` `{pod_name}` | grep -A 5 "Failed to pull image"
Common Causes & Fixes
1. Image Doesn't Exist
# Verify image exists in GHCR
docker pull ghcr.io/`{org}`/`{image}`:`{tag}`
# Fix: Build and push image
cd /path/to/repo
docker build -t ghcr.io/`{org}`/`{image}`:`{tag}` .
docker push ghcr.io/`{org}`/`{image}`:`{tag}`
2. Private Registry Authentication
# Check image pull secret exists
kubectl get secrets -n ``{namespace}`` | grep regcred
# Fix: Create image pull secret
kubectl create secret docker-registry regcred \
--docker-server=ghcr.io \
--docker-username=`{github_username}` \
--docker-password=`{github_token}` \
-n ``{namespace}``
# Update deployment to use secret
kubectl patch serviceaccount default \
-p '{"imagePullSecrets": [{"name": "regcred"}]}' \
-n ``{namespace}``
3. Wrong Image Tag
# Check available tags
# Go to: https://github.com/orgs/`{org}`/packages/`{package}`
# Fix: Update to correct tag
helm upgrade `{release_name}` ./chart \
--set image.tag=`{correct_tag}` \
-n ``{namespace}``
Pod Failed
Symptoms
- Pod status:
ErrororCompleted - Pod exited and didn't restart
Investigation
# Get pod logs
kubectl logs -n ``{namespace}`` `{pod_name}`
# Check pod events
kubectl describe pod -n ``{namespace}`` `{pod_name}`
Fix
# Delete pod to trigger recreation
kubectl delete pod -n ``{namespace}`` `{pod_name}`
# Or restart deployment
kubectl rollout restart deployment/``{engine_name}`` -n ``{namespace}``
Quick Fixes (Try These First)
1. Restart Pod
kubectl rollout restart deployment/``{engine_name}`` -n ``{namespace}``
# Wait for rollout
kubectl rollout status deployment/``{engine_name}`` -n ``{namespace}``
2. Force Pull Latest Image
# Delete pod to force image pull
kubectl delete pod -n ``{namespace}`` `{pod_name}`
# Or edit image pull policy
kubectl set image deployment/``{engine_name}`` \
`{container_name}`=ghcr.io/`{org}`/`{image}`:`{tag}` \
--record \
-n ``{namespace}``
3. Scale Down and Up
# Scale to 0
kubectl scale deployment/``{engine_name}`` --replicas=0 -n ``{namespace}``
# Wait 10 seconds
sleep 10
# Scale back up
kubectl scale deployment/``{engine_name}`` --replicas=1 -n ``{namespace}``
Prevention
Before Deploying
-
Test Locally
docker run -p 8000:8000 ghcr.io/`{org}`/`{image}`:`{tag}`
curl http://localhost:8000/.well-known/engine-status -
Check Resource Limits
- Ensure limits match application needs
- Test with load
-
Verify Image Exists
docker pull ghcr.io/`{org}`/`{image}`:`{tag}`
Monitoring
-
Set Up Alerts
- Configure Uptime Kuma monitors for each engine
- Alert in Slack
#alerts-criticalwhen engine down >5 minutes - Alert on high restart count
- Monitor via PostHog analytics for activity drops
-
Regular Health Checks
- Monitor Control Center dashboard daily
- Review Uptime Kuma status page
Escalation
If issue persists after 30 minutes:
-
Check Cluster Health
kubectl get nodes
kubectl top nodes -
Review Recent Changes
- Check GitHub commits
- Check Argo CD sync history
-
Contact Support
- Post in Slack
#alerts-criticalwith details - Gather diagnostic bundle:
kubectl logs -n ``{namespace}`` deployment/``{engine_name}`` > logs.txt
kubectl describe pod -n ``{namespace}`` `{pod_name}` > describe.txt
kubectl get events -n ``{namespace}`` --sort-by='.lastTimestamp' > events.txt
- Post in Slack
Post-Incident
1. Document Resolution
- What caused the issue?
- What fixed it?
- How long was it down?
- Post summary in Slack
#alerts-critical
2. Update Monitoring
- Add Uptime Kuma monitor if not caught
- Adjust alert thresholds if needed
- Configure CrowdSec rules if security-related
3. Prevent Recurrence
- Fix root cause
- Add automated checks
- Update this runbook
Checklist
- Verified pod status
- Checked pod logs
- Attempted restart
- Verified fix (engine shows green in Uptime Kuma & Control Center)
- Updated incident log
- Posted resolution in Slack
- Reviewed prevention steps
Last Updated: 2026-03-25 Version: 1.1