Skip to main content

Hello Engine Operations Runbook

Historical Runbook

This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.

Status: In Development Owner: Platform Team

This runbook covers operational procedures for the hello-engine service, a lightweight demonstration engine used for testing and client onboarding.


  • Service manifest: engines/client-sample.hello-engine.yaml
  • Deployment chart: charts/hello-engine/
  • Uptime Kuma: Hello Engine monitors
  • PostHog: Hello Engine analytics dashboard
  • Slack: #alerts-warning, #alerts-critical

Service Overview

Hello Engine provides:

  • Simple health check endpoints
  • Demo API responses
  • Client onboarding testing
  • Platform health verification

Expected Usage:

  • Development and staging environments
  • Client demo environments
  • Platform integration testing
  • Smoke testing new deployments

Common Operations

Check Service Status

# Check pod status
kubectl get pods -n ``{namespace}`` -l app=hello-engine

# Check logs
kubectl logs -n ``{namespace}`` deployment/hello-engine --tail=100

# Follow logs in real-time
kubectl logs -f -n ``{namespace}`` deployment/hello-engine

# Check health endpoint
kubectl port-forward -n ``{namespace}`` svc/hello-engine 8080:80
curl http://localhost:8080/.well-known/engine-status

Scale Replicas

# Scale up for high traffic
kubectl scale deployment hello-engine --replicas=3 -n ``{namespace}``

# Scale down for maintenance
kubectl scale deployment hello-engine --replicas=1 -n ``{namespace}``

# Check scaling status
kubectl get deployment hello-engine -n ``{namespace}``

Update Configuration

# Edit deployment
kubectl edit deployment hello-engine -n ``{namespace}``

# Or update via Helm
helm upgrade hello-engine ./charts/hello-engine \
--set config.featureX=true \
--namespace ``{namespace}``

# Restart to apply changes
kubectl rollout restart deployment/hello-engine -n ``{namespace}``

Monitoring Configuration

Uptime Kuma

Monitors to Configure:

  1. Health Check Monitor

    • Type: HTTP(s)
    • Name: Hello Engine - {namespace}
    • URL: http://hello-engine.``{namespace}``.svc.cluster.local/.well-known/engine-status
    • Interval: 60 seconds
    • Alert: Slack #alerts-warning after 2 failures
  2. Response Time Monitor

    • Type: HTTP(s)
    • Name: Hello Engine - Response Time
    • URL: Same as above
    • Alert: Slack #alerts-warning if >1000ms

PostHog Analytics

Create Dashboard: "Hello Engine Analytics"

Track Events:

  • Health checks performed
  • API requests by endpoint
  • Response times
  • Error rates
  • Client demo sessions

Useful Insights:

// Track health check
posthog.capture('hello_engine_health_check', {
status: 'ok',
response_time_ms: 45,
namespace: 'client-demo'
})

// Track demo session
posthog.capture('hello_engine_demo_started', {
client: 'acme-corp',
user: 'demo-user'
})

Slack Alerts

Configure alert channels:

  • Service down → #alerts-critical
  • High latency → #alerts-warning
  • Demo session started → #sales-demo
  • Error rate spike → #alerts-warning

Autoscaling Configuration

Horizontal Pod Autoscaler

# Save as: k8s/hello-engine-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hello-engine
namespace: ``{namespace}``
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hello-engine
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80

Apply:

kubectl apply -f k8s/hello-engine-hpa.yaml

Autoscaling Thresholds:

  • Min replicas: 1
  • Max replicas: 5
  • CPU target: 70%
  • Memory target: 80%
  • Scale up: When threshold exceeded for 2 minutes
  • Scale down: When below threshold for 5 minutes

Deployment Procedures

Deploy New Version

# 1. Build and push image
docker build -t ghcr.io/egintegrations/hello-engine:v1.2.0 .
docker push ghcr.io/egintegrations/hello-engine:v1.2.0

# 2. Update Helm values
helm upgrade hello-engine ./charts/hello-engine \
--set image.tag=v1.2.0 \
--namespace ``{namespace}`` \
--wait

# 3. Verify rollout
kubectl rollout status deployment/hello-engine -n ``{namespace}``

# 4. Test health endpoint
kubectl port-forward -n ``{namespace}`` svc/hello-engine 8080:80
curl http://localhost:8080/.well-known/engine-status

# 5. Post in Slack #alerts-info
# "Hello Engine updated to v1.2.0 in ``{namespace}``"

Rollback Deployment

# View rollout history
kubectl rollout history deployment/hello-engine -n ``{namespace}``

# Rollback to previous version
kubectl rollout undo deployment/hello-engine -n ``{namespace}``

# Or rollback to specific revision
kubectl rollout undo deployment/hello-engine --to-revision=3 -n ``{namespace}``

# Verify rollback
kubectl rollout status deployment/hello-engine -n ``{namespace}``

# Post in Slack #alerts-critical
# "Hello Engine rolled back in ``{namespace}`` - reason: [description]"

Troubleshooting

Service Not Responding

Investigation:

# Check pod status
kubectl get pods -n ``{namespace}`` -l app=hello-engine

# Check events
kubectl get events -n ``{namespace}`` --sort-by=.lastTimestamp

# Check logs for errors
kubectl logs -n ``{namespace}`` deployment/hello-engine --tail=100 | grep ERROR

# Check resource usage
kubectl top pod -n ``{namespace}`` -l app=hello-engine

Common Issues:

  • Pod stuck in CrashLoopBackOff → See runbook: 01-engine-down.md
  • High memory usage → Increase resource limits
  • Network issues → Check service and ingress configuration

High Latency

Investigation:

# Check response times via PostHog dashboard
# Check pod resources
kubectl top pod -n ``{namespace}`` -l app=hello-engine

# Check CPU throttling
kubectl describe pod -n ``{namespace}`` {pod-name} | grep -i throttl

# Check network latency
kubectl exec -it -n ``{namespace}`` deployment/hello-engine -- ping 8.8.8.8

Fixes:

  1. Increase CPU limits
  2. Scale up replicas
  3. Enable autoscaling
  4. Check database connection pool

Feature Toggle Not Working

Investigation:

# Check environment variables
kubectl get deployment hello-engine -n ``{namespace}`` -o yaml | grep -A 10 env:

# Check config map
kubectl get configmap hello-engine-config -n ``{namespace}`` -o yaml

# Test feature toggle endpoint
curl http://localhost:8080/api/features

Fixes:

  1. Update ConfigMap
  2. Restart deployment
  3. Verify feature flag in code

Feature Toggle Mapping

Document feature toggles for hello-engine:

FeatureEnvironment VariableDefaultDescription
Demo ModeDEMO_MODE_ENABLEDtrueEnable demo responses
AnalyticsANALYTICS_ENABLEDtrueSend events to PostHog
Debug LoggingDEBUG_LOGGINGfalseVerbose logs
Health CheckHEALTH_CHECK_ENABLEDtrueEnable health endpoint

Update Feature Toggles

# Via Helm
helm upgrade hello-engine ./charts/hello-engine \
--set env.DEMO_MODE_ENABLED=false \
--namespace ``{namespace}``

# Or edit ConfigMap
kubectl edit configmap hello-engine-config -n ``{namespace}``

# Restart to apply
kubectl rollout restart deployment/hello-engine -n ``{namespace}``

Maintenance Checklist

Daily:

  • Check Uptime Kuma status
  • Review error logs for anomalies
  • Verify PostHog receiving events

Weekly:

  • Review response time trends in PostHog
  • Check resource utilization
  • Test demo endpoints
  • Review feature toggle usage

Monthly:

  • Update hello-engine to latest version
  • Audit feature toggles
  • Review autoscaling configuration
  • Test disaster recovery procedure
  • Update documentation

Security Monitoring with CrowdSec

Configure CrowdSec Rules

# CrowdSec scenario: Hello Engine abuse
type: leaky
name: egi/hello-engine-abuse
description: "Detect excessive hello-engine requests"
filter: "evt.Meta.service == 'hello-engine'"
leakspeed: "10s"
capacity: 100
labels:
service: hello-engine
remediation: ban

Monitor For:

  • Excessive health check requests
  • API endpoint abuse
  • Failed authentication attempts
  • Suspicious traffic patterns

Actions:

  • Auto-ban IPs exceeding rate limits
  • Alert in Slack #security-alerts
  • Log security events for review

Production Readiness Checklist

Before promoting hello-engine to production:

  • Autoscaling configured and tested
  • Resource limits properly set
  • Health checks configured in Uptime Kuma
  • PostHog analytics tracking all events
  • Slack alerts configured for all severity levels
  • CrowdSec security rules active
  • Backup and rollback procedures tested
  • Documentation complete and reviewed
  • On-call escalation path defined
  • Client-specific configurations documented

Future Documentation Tasks

This runbook will be expanded to include:

  • Document expected replicas for each environment
  • Capture detailed restart/rollback procedures
  • Record all feature toggle mappings and fallbacks
  • Add client-specific customizations
  • Document integration points
  • Add performance tuning guidelines
  • Create troubleshooting decision tree
  • Add load testing procedures

Last Updated: 2026-03-25 Version: 1.0 Status: Draft - Update before production deployment