Hello Engine Operations Runbook

Historical Runbook

This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.

Status: In Development Owner: Platform Team

This runbook covers operational procedures for the hello-engine service, a lightweight demonstration engine used for testing and client onboarding.

Quick Links

Service manifest: engines/client-sample.hello-engine.yaml
Deployment chart: charts/hello-engine/
Uptime Kuma: Hello Engine monitors
PostHog: Hello Engine analytics dashboard
Slack: #alerts-warning, #alerts-critical

Service Overview

Hello Engine provides:

Simple health check endpoints
Demo API responses
Client onboarding testing
Platform health verification

Expected Usage:

Development and staging environments
Client demo environments
Platform integration testing
Smoke testing new deployments

Common Operations

Check Service Status

# Check pod status
kubectl get pods -n ``{namespace}`` -l app=hello-engine

# Check logs
kubectl logs -n ``{namespace}`` deployment/hello-engine --tail=100

# Follow logs in real-time
kubectl logs -f -n ``{namespace}`` deployment/hello-engine

# Check health endpoint
kubectl port-forward -n ``{namespace}`` svc/hello-engine 8080:80
curl http://localhost:8080/.well-known/engine-status

Scale Replicas

# Scale up for high traffic
kubectl scale deployment hello-engine --replicas=3 -n ``{namespace}``

# Scale down for maintenance
kubectl scale deployment hello-engine --replicas=1 -n ``{namespace}``

# Check scaling status
kubectl get deployment hello-engine -n ``{namespace}``

Update Configuration

# Edit deployment
kubectl edit deployment hello-engine -n ``{namespace}``

# Or update via Helm
helm upgrade hello-engine ./charts/hello-engine \
  --set config.featureX=true \
  --namespace ``{namespace}``

# Restart to apply changes
kubectl rollout restart deployment/hello-engine -n ``{namespace}``

Monitoring Configuration

Uptime Kuma

Monitors to Configure:

Health Check Monitor
- Type: HTTP(s)
- Name: Hello Engine - {namespace}
- URL: http://hello-engine.``{namespace}``.svc.cluster.local/.well-known/engine-status
- Interval: 60 seconds
- Alert: Slack #alerts-warning after 2 failures
Response Time Monitor
- Type: HTTP(s)
- Name: Hello Engine - Response Time
- URL: Same as above
- Alert: Slack #alerts-warning if >1000ms

PostHog Analytics

Create Dashboard: "Hello Engine Analytics"

Track Events:

Health checks performed
API requests by endpoint
Response times
Error rates
Client demo sessions

Useful Insights:

// Track health check
posthog.capture('hello_engine_health_check', {
  status: 'ok',
  response_time_ms: 45,
  namespace: 'client-demo'
})

// Track demo session
posthog.capture('hello_engine_demo_started', {
  client: 'acme-corp',
  user: 'demo-user'
})

Slack Alerts

Configure alert channels:

Service down → #alerts-critical
High latency → #alerts-warning
Demo session started → #sales-demo
Error rate spike → #alerts-warning

Autoscaling Configuration

Horizontal Pod Autoscaler

# Save as: k8s/hello-engine-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hello-engine
  namespace: ``{namespace}``
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hello-engine
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Apply:

kubectl apply -f k8s/hello-engine-hpa.yaml

Autoscaling Thresholds:

Min replicas: 1
Max replicas: 5
CPU target: 70%
Memory target: 80%
Scale up: When threshold exceeded for 2 minutes
Scale down: When below threshold for 5 minutes

Deployment Procedures

Deploy New Version

# 1. Build and push image
docker build -t ghcr.io/egintegrations/hello-engine:v1.2.0 .
docker push ghcr.io/egintegrations/hello-engine:v1.2.0

# 2. Update Helm values
helm upgrade hello-engine ./charts/hello-engine \
  --set image.tag=v1.2.0 \
  --namespace ``{namespace}`` \
  --wait

# 3. Verify rollout
kubectl rollout status deployment/hello-engine -n ``{namespace}``

# 4. Test health endpoint
kubectl port-forward -n ``{namespace}`` svc/hello-engine 8080:80
curl http://localhost:8080/.well-known/engine-status

# 5. Post in Slack #alerts-info
# "Hello Engine updated to v1.2.0 in ``{namespace}``"

Rollback Deployment

# View rollout history
kubectl rollout history deployment/hello-engine -n ``{namespace}``

# Rollback to previous version
kubectl rollout undo deployment/hello-engine -n ``{namespace}``

# Or rollback to specific revision
kubectl rollout undo deployment/hello-engine --to-revision=3 -n ``{namespace}``

# Verify rollback
kubectl rollout status deployment/hello-engine -n ``{namespace}``

# Post in Slack #alerts-critical
# "Hello Engine rolled back in ``{namespace}`` - reason: [description]"

Troubleshooting

Service Not Responding

Investigation:

# Check pod status
kubectl get pods -n ``{namespace}`` -l app=hello-engine

# Check events
kubectl get events -n ``{namespace}`` --sort-by=.lastTimestamp

# Check logs for errors
kubectl logs -n ``{namespace}`` deployment/hello-engine --tail=100 | grep ERROR

# Check resource usage
kubectl top pod -n ``{namespace}`` -l app=hello-engine

Common Issues:

Pod stuck in CrashLoopBackOff → See runbook: 01-engine-down.md
High memory usage → Increase resource limits
Network issues → Check service and ingress configuration

High Latency

Investigation:

# Check response times via PostHog dashboard
# Check pod resources
kubectl top pod -n ``{namespace}`` -l app=hello-engine

# Check CPU throttling
kubectl describe pod -n ``{namespace}`` {pod-name} | grep -i throttl

# Check network latency
kubectl exec -it -n ``{namespace}`` deployment/hello-engine -- ping 8.8.8.8

Fixes:

Increase CPU limits
Scale up replicas
Enable autoscaling
Check database connection pool

Feature Toggle Not Working

Investigation:

# Check environment variables
kubectl get deployment hello-engine -n ``{namespace}`` -o yaml | grep -A 10 env:

# Check config map
kubectl get configmap hello-engine-config -n ``{namespace}`` -o yaml

# Test feature toggle endpoint
curl http://localhost:8080/api/features

Fixes:

Update ConfigMap
Restart deployment
Verify feature flag in code

Feature Toggle Mapping

Document feature toggles for hello-engine:

Feature	Environment Variable	Default	Description
Demo Mode	`DEMO_MODE_ENABLED`	`true`	Enable demo responses
Analytics	`ANALYTICS_ENABLED`	`true`	Send events to PostHog
Debug Logging	`DEBUG_LOGGING`	`false`	Verbose logs
Health Check	`HEALTH_CHECK_ENABLED`	`true`	Enable health endpoint

Update Feature Toggles

# Via Helm
helm upgrade hello-engine ./charts/hello-engine \
  --set env.DEMO_MODE_ENABLED=false \
  --namespace ``{namespace}``

# Or edit ConfigMap
kubectl edit configmap hello-engine-config -n ``{namespace}``

# Restart to apply
kubectl rollout restart deployment/hello-engine -n ``{namespace}``

Maintenance Checklist

Daily:

Check Uptime Kuma status
Review error logs for anomalies
Verify PostHog receiving events

Weekly:

Review response time trends in PostHog
Check resource utilization
Test demo endpoints
Review feature toggle usage

Monthly:

Update hello-engine to latest version
Audit feature toggles
Review autoscaling configuration
Test disaster recovery procedure
Update documentation

Security Monitoring with CrowdSec

Configure CrowdSec Rules

# CrowdSec scenario: Hello Engine abuse
type: leaky
name: egi/hello-engine-abuse
description: "Detect excessive hello-engine requests"
filter: "evt.Meta.service == 'hello-engine'"
leakspeed: "10s"
capacity: 100
labels:
  service: hello-engine
  remediation: ban

Monitor For:

Excessive health check requests
API endpoint abuse
Failed authentication attempts
Suspicious traffic patterns

Actions:

Auto-ban IPs exceeding rate limits
Alert in Slack #security-alerts
Log security events for review

Production Readiness Checklist

Before promoting hello-engine to production:

Future Documentation Tasks

This runbook will be expanded to include:

Document expected replicas for each environment
Capture detailed restart/rollback procedures
Record all feature toggle mappings and fallbacks
Add client-specific customizations
Document integration points
Add performance tuning guidelines
Create troubleshooting decision tree
Add load testing procedures

Last Updated: 2026-03-25 Version: 1.0 Status: Draft - Update before production deployment

Quick Links​

Service Overview​

Common Operations​

Check Service Status​

Scale Replicas​

Update Configuration​

Monitoring Configuration​

Uptime Kuma​

PostHog Analytics​

Slack Alerts​

Autoscaling Configuration​

Horizontal Pod Autoscaler​

Deployment Procedures​

Deploy New Version​

Rollback Deployment​

Troubleshooting​

Service Not Responding​

High Latency​

Feature Toggle Not Working​

Feature Toggle Mapping​

Update Feature Toggles​

Maintenance Checklist​

Security Monitoring with CrowdSec​

Configure CrowdSec Rules​

Production Readiness Checklist​

Future Documentation Tasks​

Quick Links

Service Overview

Common Operations

Check Service Status

Scale Replicas

Update Configuration

Monitoring Configuration

Uptime Kuma

PostHog Analytics

Slack Alerts

Autoscaling Configuration

Horizontal Pod Autoscaler

Deployment Procedures

Deploy New Version

Rollback Deployment

Troubleshooting

Service Not Responding

High Latency

Feature Toggle Not Working

Feature Toggle Mapping

Update Feature Toggles

Maintenance Checklist

Security Monitoring with CrowdSec

Configure CrowdSec Rules

Production Readiness Checklist

Future Documentation Tasks