Skip to main content

Bot Operator Operations Runbook

Historical Runbook

This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.

Status: In Development Owner: Platform Team

This runbook covers operational procedures for bot operator services that handle background task processing and event queue management.


  • Service manifest: engines/client-sample.bot-operator.yaml
  • ApplicationSet: k8s/argocd/applicationset-bot-operator.yaml
  • Source: apps/bot-operator/
  • Uptime Kuma: Bot operator service monitors
  • Slack Alerts: #alerts-warning, #alerts-critical

Common Operations

Check Bot Operator Status

# Check pod status
kubectl get pods -n `{namespace}` -l app=bot-operator

# Check logs
kubectl logs -n `{namespace}` deployment/bot-operator --tail=100

# Check health endpoint
kubectl port-forward -n `{namespace}` svc/bot-operator 8080:80
curl http://localhost:8080/.well-known/engine-status

Monitor Queue Backlog

Event Queue Integrations:

  • AWS SQS
  • Google Pub/Sub
  • Redis Queue
  • RabbitMQ
# Check queue depth (example for SQS)
aws sqs get-queue-attributes \
--queue-url `{queue_url}` \
--attribute-names ApproximateNumberOfMessages

# Check processing rate via PostHog
# Dashboard: Bot Operator → Queue Processing Metrics

Alert Thresholds

Configure Uptime Kuma monitors:

  1. Queue Backlog Alert (Warning)

    • Threshold: >1000 messages
    • Duration: >10 minutes
    • Action: Post to Slack #alerts-warning
  2. Queue Backlog Critical (Critical)

    • Threshold: >10,000 messages
    • Duration: >5 minutes
    • Action: Post to Slack #alerts-critical
  3. Processing Error Rate (Warning)

    • Threshold: >5% error rate
    • Duration: >5 minutes
    • Action: Post to Slack #alerts-warning
  4. Bot Operator Down (Critical)

    • Threshold: Health check fails
    • Duration: >2 minutes
    • Action: Post to Slack #alerts-critical

Troubleshooting

Queue Backlog Growing

Symptoms:

  • Messages piling up in queue
  • Slow processing rate
  • PostHog showing decreased throughput

Investigation:

# Check CPU/memory usage
kubectl top pod -n `{namespace}` -l app=bot-operator

# Check logs for errors
kubectl logs -n `{namespace}` deployment/bot-operator --tail=200

# Check worker count
kubectl get deployment bot-operator -n `{namespace}` -o yaml | grep replicas

Fixes:

  1. Scale Up Workers

    kubectl scale deployment bot-operator --replicas=5 -n `{namespace}`
  2. Increase Resources

    # Edit Helm values
    resources:
    limits:
    cpu: 2000m
    memory: 4Gi
  3. Purge Failed Messages

    # Move to dead letter queue
    aws sqs purge-queue --queue-url `{dlq_url}`

High Error Rate

Investigation:

# Check error logs
kubectl logs -n `{namespace}` deployment/bot-operator | grep ERROR

# Check PostHog error tracking
# Dashboard: Bot Operator → Error Analysis

Common Causes:

  • External API timeouts
  • Database connection issues
  • Invalid message format
  • Rate limiting

Fixes:

  1. Retry failed messages
  2. Increase API timeout settings
  3. Check database connectivity
  4. Implement exponential backoff

Deployment & Rollback

Deploy New Version

# Build and push new image
docker build -t ghcr.io/egintegrations/bot-operator:v1.2.0 .
docker push ghcr.io/egintegrations/bot-operator:v1.2.0

# Update Helm values
helm upgrade bot-operator ./chart \
--set image.tag=v1.2.0 \
--namespace `{namespace}`

# Monitor rollout
kubectl rollout status deployment/bot-operator -n `{namespace}`

# Post in Slack #alerts-info
# "Bot operator updated to v1.2.0"

Rollback

# Rollback to previous version
helm rollback bot-operator -n `{namespace}`

# Verify rollback
kubectl rollout status deployment/bot-operator -n `{namespace}`

# Post in Slack #alerts-critical
# "Bot operator rolled back due to [reason]"

Manual Failover

If bot operator fails completely:

# 1. Stop current deployment
kubectl scale deployment bot-operator --replicas=0 -n `{namespace}`

# 2. Deploy backup instance
kubectl apply -f k8s/bot-operator-backup.yaml

# 3. Redirect queue traffic
# (Update queue subscription or consumer group)

# 4. Verify new instance processing
kubectl logs -f -n `{namespace}` deployment/bot-operator-backup

# 5. Post in Slack #alerts-critical
# "Manual failover to backup bot operator instance"

Monitoring & Alerts

Uptime Kuma Configuration

Create monitors for:

  1. Bot operator health endpoint
  2. Queue depth threshold
  3. Processing rate threshold
  4. Error rate threshold

PostHog Dashboards

Track:

  • Messages processed per hour
  • Average processing time
  • Error rate by error type
  • Queue depth over time
  • Worker utilization

CrowdSec Integration

Monitor for:

  • Suspicious message patterns
  • Excessive failed authentication attempts
  • DDoS-style message floods

Maintenance Checklist

Daily:

  • Check Uptime Kuma status
  • Review queue backlog in PostHog
  • Check error logs for anomalies

Weekly:

  • Review processing rate trends
  • Analyze error patterns
  • Check resource utilization
  • Review CrowdSec security logs

Monthly:

  • Audit queue dead letter queue
  • Review and optimize worker configuration
  • Update documentation
  • Test failover procedure

Future Documentation Tasks

This runbook will be expanded to include:

  • Document event queue integrations (SQS, Pub/Sub, etc.)
  • Define specific alert thresholds for each queue type
  • Capture detailed redeploy/rollback procedures
  • Add client-specific configuration examples
  • Document message format validation
  • Add performance tuning guidelines
  • Create troubleshooting decision tree
  • Add on-call escalation procedures

Emergency Contacts

Bot Operator Issues:

  • Post in Slack #alerts-critical
  • Tag platform team lead
  • Include: namespace, error logs, queue metrics

Queue Service Issues:

  • AWS SQS: Check AWS console
  • Google Pub/Sub: Check GCP console
  • Post issue details in Slack #alerts-critical

Last Updated: 2026-03-25 Version: 1.0 Status: Draft - Update before production deployment