Runbook: Deploy New Engine for Client

Historical Runbook

This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.

Complexity: MEDIUM Time Required: 30-60 minutes Owner: Platform Team

Overview

This runbook walks through deploying a new engine for a client using one of the EGI templates.

Prerequisites

Client added to Control Center
Kubernetes namespace created
GitHub repository created (if using CI/CD)
GHCR access configured
Helm installed locally
Uptime Kuma monitor prepared for new engine
Slack notification channel ready

Step 1: Choose Template (2 minutes)

Available Templates:

api-fastapi - REST API with database
bot-worker - Background task processing
rag-engine - LLM-powered Q&A system

Selection Criteria:

Need API endpoints? → api-fastapi
Need background jobs? → bot-worker
Need document search/chat? → rag-engine

Step 2: Copy Template (5 minutes)

# Navigate to templates
cd /Users/elliottgodwin/Desktop/egi-engine/templates

# Copy chosen template
cp -r api-fastapi ~/client-projects/{client-name}-{module}

# Navigate to new directory
cd ~/client-projects/{client-name}-{module}

# Example:
# cp -r api-fastapi ~/client-projects/acme-corp-api
# cd ~/client-projects/acme-corp-api

Step 3: Configure Engine (10 minutes)

A. Update Engine Identification

Edit: app/main.py (or worker/main.py for worker)

# Change these lines:
ENGINE_ID = os.getenv("ENGINE_ID", "{client-name}.{module}")
ENGINE_SKU = os.getenv("ENGINE_SKU", "{X.Y.Z}.{TYPE}.{PLATFORM}.{YYYYMMDD}")

# Example:
ENGINE_ID = os.getenv("ENGINE_ID", "acme-corp.api")
ENGINE_SKU = os.getenv("ENGINE_SKU", "1.0.0.API.EKS.20260325")

Generate SKU:

cd /Users/elliottgodwin/Desktop/egi-engine
python standards/sku/gen_sku.py -M 1 -m 0 -p 0 -t API -P EKS

B. Update Helm Values

Edit: chart/values.yaml

# Update image repository
image:
  repository: ghcr.io/egintegrations/{client-name}-{module}
  tag: "main"

# Update environment variables
env:
  ENGINE_ID: "{client-name}.{module}"
  ENGINE_SKU: "{generated_sku}"
  DATABASE_URL: "postgresql://..." # If using database

C. Update README

Edit: README.md

Replace "my-new-api" with actual name
Update deployment instructions
Add client-specific notes

Step 4: Initialize Git Repository (5 minutes)

# Initialize git
git init

# Add all files
git add .

# Initial commit
git commit -m "feat: initialize {client-name} {module} from EGI template

Based on: templates/{template-name}
Client: {client-name}
Module: {module}"

# Create GitHub repository
gh repo create {client-name}-{module} --private --source=. --remote=origin

# Push to GitHub
git push -u origin main

Step 5: Configure GitHub Actions (3 minutes)

Verify workflow file exists:

ls .github/workflows/build.yaml

Add GitHub Secret for GHCR:

Go to: https://github.com/{org}/{repo}/settings/secrets/actions
Click "New repository secret"
Name: GHCR_PAT
Value: Your GitHub Personal Access Token
Click "Add secret"

Create PAT if needed:

Go to: https://github.com/settings/tokens
Generate new token (classic)
Scopes: write:packages, read:packages
Copy token immediately

Step 6: Build and Push Image (10 minutes)

Option A: Manual Build (Faster for First Deploy)

# Login to GHCR
echo $GHCR_PAT | docker login ghcr.io -u {github_username} --password-stdin

# Build image
docker build -t ghcr.io/egintegrations/{client-name}-{module}:main .

# Push image
docker push ghcr.io/egintegrations/{client-name}-{module}:main

# Verify image exists
docker pull ghcr.io/egintegrations/{client-name}-{module}:main

Option B: GitHub Actions (Automated)

# Push triggers automatic build
git push origin main

# Monitor build
# Go to: https://github.com/{org}/{repo}/actions

# Wait for green checkmark (2-5 minutes)

Step 7: Create Kubernetes Namespace (1 minute)

# Create namespace if doesn't exist
kubectl create namespace client-{client-slug}

# Example:
kubectl create namespace client-acme-corp

# Verify
kubectl get namespace client-{client-slug}

Step 8: Deploy with Helm (5 minutes)

# Deploy
helm upgrade --install {client-name}-{module} ./chart \
  --set image.repository=ghcr.io/egintegrations/{client-name}-{module} \
  --set image.tag=main \
  --namespace client-{client-slug} \
  --create-namespace \
  --wait

# Example:
helm upgrade --install acme-corp-api ./chart \
  --set image.repository=ghcr.io/egintegrations/acme-corp-api \
  --set image.tag=main \
  --namespace client-acme-corp \
  --create-namespace \
  --wait

Expected output:

Release "acme-corp-api" has been installed. Happy Helming!
NAME: acme-corp-api
LAST DEPLOYED: ...
NAMESPACE: client-acme-corp
STATUS: deployed
REVISION: 1

Step 9: Verify Deployment (5 minutes)

A. Check Pods

# Check pod status
kubectl get pods -n client-{client-slug}

# Should show:
# NAME                     READY   STATUS    RESTARTS   AGE
# acme-corp-api-xxx        1/1     Running   0          2m

# Check logs
kubectl logs -f -n client-{client-slug} deployment/{client-name}-{module}

B. Test Health Endpoint

# Port forward
kubectl port-forward -n client-{client-slug} svc/{client-name}-{module} 8080:80

# Test (in another terminal)
curl http://localhost:8080/.well-known/engine-status

# Should return JSON with:
# - id: "{client-name}.{module}"
# - sku: "`{sku}`"
# - status.state: "ok"

C. Check in Control Center

# Trigger refresh
curl -X POST https://control-center.egintegrations.com/api/refresh

# Wait 2-3 minutes, then check UI
# Open: https://control-center.egintegrations.com/ui
# Engines tab should show new engine

Step 10: Register Engine (5 minutes)

Add to engines inventory:

# Edit engines.yaml
cd /Users/elliottgodwin/Desktop/egi-engine
nano engines/engines.yaml

Add entry:

- id: {client-name}.{module}
  client: {client-name}
  environment: production
  module: {module}
  type: {TYPE}  # API, BOT, RAG, etc.
  platform: EKS
  sku: {generated_sku}
  runtime: kubernetes
  namespace: client-{client-slug}
  base_url: http://{client-name}-{module}.client-{client-slug}.svc.cluster.local
  status_path: /.well-known/engine-status
  poll_interval_seconds: 900
  links:
    dashboard: https://grafana.egintegrations.com
    logs: https://grafana.egintegrations.com/explore
  notes: "Deployed on `{date}`"

Commit and push:

git add engines/engines.yaml
git commit -m "feat: register {client-name} {module} engine"
git push

Step 11: Configure Monitoring (5 minutes)

A. Uptime Kuma

Log into Uptime Kuma dashboard
Click "Add New Monitor"
Configure:
- Monitor Type: HTTP(s)
- Friendly Name: {client-name}.{module}
- URL: Health check endpoint
- Heartbeat Interval: 60 seconds
- Retries: 3
- Notifications: Slack #alerts-warning
Save monitor

B. Slack Notifications

Configure alert routing in Slack:

Critical alerts → #alerts-critical
Warning alerts → #alerts-warning
Info alerts → #alerts-info

C. PostHog Analytics (Optional)

If engine exposes user analytics:

Add PostHog API key to environment variables
Configure event tracking
Create custom dashboard in PostHog

D. CrowdSec Security (Optional)

If engine is public-facing:

Configure CrowdSec agent for namespace
Set up IP ban rules
Monitor security events in CrowdSec dashboard

Step 12: Verify Everything (5 minutes)

Checklist:

Post-Deployment

1. Document

Create deployment record:

# In client project folder
cat > DEPLOYMENT.md <<EOF
# Deployment Record

**Date:** `$(date)`
**Engine:** {client-name}.{module}
**SKU:** `{sku}`
**Namespace:** client-{client-slug}
**Image:** ghcr.io/egintegrations/{client-name}-{module}:main

## Deployed By
- Name: `{your_name}`
- Date: `$(date)`

## Configuration
- Database: `{if applicable}`
- External Services: `{if applicable}`
- Environment Variables: `{list important ones}`

## Monitoring
- Uptime Kuma: ✅
- Slack Alerts: ✅ (#alerts-warning, #alerts-critical)
- PostHog Analytics: `{✅ or N/A}`
- CrowdSec Security: `{✅ or N/A}`

## Verification
- Health Check: ✅
- Control Center: ✅
- Monitoring: ✅

## Notes
{any special notes}
EOF

2. Notify Team

Post in Slack with:

Engine name and ID
Deployment status
Monitoring links (Uptime Kuma, PostHog)
Support contact
Next steps

3. Set Up Alerts

Verify alerts are configured:

Uptime Kuma: Pod down, health check failure
PostHog: Usage analytics, error tracking
CrowdSec: Security threats, IP bans
Resource alerts: High CPU/memory via Kubernetes metrics

Rollback Procedure

If deployment fails:

# Delete deployment
helm uninstall {client-name}-{module} -n client-{client-slug}

# Or rollback to previous version
helm rollback {client-name}-{module} -n client-{client-slug}

# Check status
helm history {client-name}-{module} -n client-{client-slug}

Post rollback notification in Slack #alerts-critical.

Troubleshooting

Pod Not Starting

See: runbooks/01-engine-down.md

Image Pull Failed

# Check image exists
docker pull ghcr.io/egintegrations/{client-name}-{module}:main

# Check image pull secrets
kubectl get secrets -n client-{client-slug}

# Create if missing (see Step 6)

Health Check Failing

# Check logs
kubectl logs -n client-{client-slug} deployment/{client-name}-{module}

# Common issues:
# - Database not accessible
# - Missing environment variables
# - Port mismatch

Next Steps

After successful deployment:

Set up CI/CD for automatic deployments
Configure ingress (if public-facing)
Add custom domain
Set up automated backups
Create client-specific dashboard in PostHog
Review CrowdSec security logs weekly

Last Updated: 2026-03-25 Version: 1.1

Overview​

Prerequisites​

Step 1: Choose Template (2 minutes)​

Step 2: Copy Template (5 minutes)​

Step 3: Configure Engine (10 minutes)​

A. Update Engine Identification​

B. Update Helm Values​

C. Update README​

Step 4: Initialize Git Repository (5 minutes)​

Step 5: Configure GitHub Actions (3 minutes)​

Step 6: Build and Push Image (10 minutes)​

Option A: Manual Build (Faster for First Deploy)​

Option B: GitHub Actions (Automated)​

Step 7: Create Kubernetes Namespace (1 minute)​

Step 8: Deploy with Helm (5 minutes)​

Step 9: Verify Deployment (5 minutes)​

A. Check Pods​

B. Test Health Endpoint​

C. Check in Control Center​

Step 10: Register Engine (5 minutes)​

Step 11: Configure Monitoring (5 minutes)​

A. Uptime Kuma​

B. Slack Notifications​

C. PostHog Analytics (Optional)​

D. CrowdSec Security (Optional)​

Step 12: Verify Everything (5 minutes)​

Post-Deployment​

1. Document​

2. Notify Team​

3. Set Up Alerts​

Rollback Procedure​

Troubleshooting​

Pod Not Starting​

Image Pull Failed​

Health Check Failing​

Next Steps​