Hello Engine Operations Runbook
This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.
Status: In Development Owner: Platform Team
This runbook covers operational procedures for the hello-engine service, a lightweight demonstration engine used for testing and client onboarding.
Quick Links
- Service manifest:
engines/client-sample.hello-engine.yaml - Deployment chart:
charts/hello-engine/ - Uptime Kuma: Hello Engine monitors
- PostHog: Hello Engine analytics dashboard
- Slack:
#alerts-warning,#alerts-critical
Service Overview
Hello Engine provides:
- Simple health check endpoints
- Demo API responses
- Client onboarding testing
- Platform health verification
Expected Usage:
- Development and staging environments
- Client demo environments
- Platform integration testing
- Smoke testing new deployments
Common Operations
Check Service Status
# Check pod status
kubectl get pods -n ``{namespace}`` -l app=hello-engine
# Check logs
kubectl logs -n ``{namespace}`` deployment/hello-engine --tail=100
# Follow logs in real-time
kubectl logs -f -n ``{namespace}`` deployment/hello-engine
# Check health endpoint
kubectl port-forward -n ``{namespace}`` svc/hello-engine 8080:80
curl http://localhost:8080/.well-known/engine-status
Scale Replicas
# Scale up for high traffic
kubectl scale deployment hello-engine --replicas=3 -n ``{namespace}``
# Scale down for maintenance
kubectl scale deployment hello-engine --replicas=1 -n ``{namespace}``
# Check scaling status
kubectl get deployment hello-engine -n ``{namespace}``
Update Configuration
# Edit deployment
kubectl edit deployment hello-engine -n ``{namespace}``
# Or update via Helm
helm upgrade hello-engine ./charts/hello-engine \
--set config.featureX=true \
--namespace ``{namespace}``
# Restart to apply changes
kubectl rollout restart deployment/hello-engine -n ``{namespace}``
Monitoring Configuration
Uptime Kuma
Monitors to Configure:
-
Health Check Monitor
- Type: HTTP(s)
- Name: Hello Engine -
{namespace} - URL:
http://hello-engine.``{namespace}``.svc.cluster.local/.well-known/engine-status - Interval: 60 seconds
- Alert: Slack
#alerts-warningafter 2 failures
-
Response Time Monitor
- Type: HTTP(s)
- Name: Hello Engine - Response Time
- URL: Same as above
- Alert: Slack
#alerts-warningif >1000ms
PostHog Analytics
Create Dashboard: "Hello Engine Analytics"
Track Events:
- Health checks performed
- API requests by endpoint
- Response times
- Error rates
- Client demo sessions
Useful Insights:
// Track health check
posthog.capture('hello_engine_health_check', {
status: 'ok',
response_time_ms: 45,
namespace: 'client-demo'
})
// Track demo session
posthog.capture('hello_engine_demo_started', {
client: 'acme-corp',
user: 'demo-user'
})
Slack Alerts
Configure alert channels:
- Service down →
#alerts-critical - High latency →
#alerts-warning - Demo session started →
#sales-demo - Error rate spike →
#alerts-warning
Autoscaling Configuration
Horizontal Pod Autoscaler
# Save as: k8s/hello-engine-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hello-engine
namespace: ``{namespace}``
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hello-engine
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Apply:
kubectl apply -f k8s/hello-engine-hpa.yaml
Autoscaling Thresholds:
- Min replicas: 1
- Max replicas: 5
- CPU target: 70%
- Memory target: 80%
- Scale up: When threshold exceeded for 2 minutes
- Scale down: When below threshold for 5 minutes
Deployment Procedures
Deploy New Version
# 1. Build and push image
docker build -t ghcr.io/egintegrations/hello-engine:v1.2.0 .
docker push ghcr.io/egintegrations/hello-engine:v1.2.0
# 2. Update Helm values
helm upgrade hello-engine ./charts/hello-engine \
--set image.tag=v1.2.0 \
--namespace ``{namespace}`` \
--wait
# 3. Verify rollout
kubectl rollout status deployment/hello-engine -n ``{namespace}``
# 4. Test health endpoint
kubectl port-forward -n ``{namespace}`` svc/hello-engine 8080:80
curl http://localhost:8080/.well-known/engine-status
# 5. Post in Slack #alerts-info
# "Hello Engine updated to v1.2.0 in ``{namespace}``"
Rollback Deployment
# View rollout history
kubectl rollout history deployment/hello-engine -n ``{namespace}``
# Rollback to previous version
kubectl rollout undo deployment/hello-engine -n ``{namespace}``
# Or rollback to specific revision
kubectl rollout undo deployment/hello-engine --to-revision=3 -n ``{namespace}``
# Verify rollback
kubectl rollout status deployment/hello-engine -n ``{namespace}``
# Post in Slack #alerts-critical
# "Hello Engine rolled back in ``{namespace}`` - reason: [description]"
Troubleshooting
Service Not Responding
Investigation:
# Check pod status
kubectl get pods -n ``{namespace}`` -l app=hello-engine
# Check events
kubectl get events -n ``{namespace}`` --sort-by=.lastTimestamp
# Check logs for errors
kubectl logs -n ``{namespace}`` deployment/hello-engine --tail=100 | grep ERROR
# Check resource usage
kubectl top pod -n ``{namespace}`` -l app=hello-engine
Common Issues:
- Pod stuck in CrashLoopBackOff → See runbook:
01-engine-down.md - High memory usage → Increase resource limits
- Network issues → Check service and ingress configuration
High Latency
Investigation:
# Check response times via PostHog dashboard
# Check pod resources
kubectl top pod -n ``{namespace}`` -l app=hello-engine
# Check CPU throttling
kubectl describe pod -n ``{namespace}`` {pod-name} | grep -i throttl
# Check network latency
kubectl exec -it -n ``{namespace}`` deployment/hello-engine -- ping 8.8.8.8
Fixes:
- Increase CPU limits
- Scale up replicas
- Enable autoscaling
- Check database connection pool
Feature Toggle Not Working
Investigation:
# Check environment variables
kubectl get deployment hello-engine -n ``{namespace}`` -o yaml | grep -A 10 env:
# Check config map
kubectl get configmap hello-engine-config -n ``{namespace}`` -o yaml
# Test feature toggle endpoint
curl http://localhost:8080/api/features
Fixes:
- Update ConfigMap
- Restart deployment
- Verify feature flag in code
Feature Toggle Mapping
Document feature toggles for hello-engine:
| Feature | Environment Variable | Default | Description |
|---|---|---|---|
| Demo Mode | DEMO_MODE_ENABLED | true | Enable demo responses |
| Analytics | ANALYTICS_ENABLED | true | Send events to PostHog |
| Debug Logging | DEBUG_LOGGING | false | Verbose logs |
| Health Check | HEALTH_CHECK_ENABLED | true | Enable health endpoint |
Update Feature Toggles
# Via Helm
helm upgrade hello-engine ./charts/hello-engine \
--set env.DEMO_MODE_ENABLED=false \
--namespace ``{namespace}``
# Or edit ConfigMap
kubectl edit configmap hello-engine-config -n ``{namespace}``
# Restart to apply
kubectl rollout restart deployment/hello-engine -n ``{namespace}``
Maintenance Checklist
Daily:
- Check Uptime Kuma status
- Review error logs for anomalies
- Verify PostHog receiving events
Weekly:
- Review response time trends in PostHog
- Check resource utilization
- Test demo endpoints
- Review feature toggle usage
Monthly:
- Update hello-engine to latest version
- Audit feature toggles
- Review autoscaling configuration
- Test disaster recovery procedure
- Update documentation
Security Monitoring with CrowdSec
Configure CrowdSec Rules
# CrowdSec scenario: Hello Engine abuse
type: leaky
name: egi/hello-engine-abuse
description: "Detect excessive hello-engine requests"
filter: "evt.Meta.service == 'hello-engine'"
leakspeed: "10s"
capacity: 100
labels:
service: hello-engine
remediation: ban
Monitor For:
- Excessive health check requests
- API endpoint abuse
- Failed authentication attempts
- Suspicious traffic patterns
Actions:
- Auto-ban IPs exceeding rate limits
- Alert in Slack
#security-alerts - Log security events for review
Production Readiness Checklist
Before promoting hello-engine to production:
- Autoscaling configured and tested
- Resource limits properly set
- Health checks configured in Uptime Kuma
- PostHog analytics tracking all events
- Slack alerts configured for all severity levels
- CrowdSec security rules active
- Backup and rollback procedures tested
- Documentation complete and reviewed
- On-call escalation path defined
- Client-specific configurations documented
Future Documentation Tasks
This runbook will be expanded to include:
- Document expected replicas for each environment
- Capture detailed restart/rollback procedures
- Record all feature toggle mappings and fallbacks
- Add client-specific customizations
- Document integration points
- Add performance tuning guidelines
- Create troubleshooting decision tree
- Add load testing procedures
Last Updated: 2026-03-25 Version: 1.0 Status: Draft - Update before production deployment