Bot Operator Operations Runbook

Historical Runbook

This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.

Status: In Development Owner: Platform Team

This runbook covers operational procedures for bot operator services that handle background task processing and event queue management.

Quick Links

Service manifest: engines/client-sample.bot-operator.yaml
ApplicationSet: k8s/argocd/applicationset-bot-operator.yaml
Source: apps/bot-operator/
Uptime Kuma: Bot operator service monitors
Slack Alerts: #alerts-warning, #alerts-critical

Common Operations

Check Bot Operator Status

# Check pod status
kubectl get pods -n `{namespace}` -l app=bot-operator

# Check logs
kubectl logs -n `{namespace}` deployment/bot-operator --tail=100

# Check health endpoint
kubectl port-forward -n `{namespace}` svc/bot-operator 8080:80
curl http://localhost:8080/.well-known/engine-status

Monitor Queue Backlog

Event Queue Integrations:

AWS SQS
Google Pub/Sub
Redis Queue
RabbitMQ

# Check queue depth (example for SQS)
aws sqs get-queue-attributes \
  --queue-url `{queue_url}` \
  --attribute-names ApproximateNumberOfMessages

# Check processing rate via PostHog
# Dashboard: Bot Operator → Queue Processing Metrics

Alert Thresholds

Configure Uptime Kuma monitors:

Queue Backlog Alert (Warning)
- Threshold: >1000 messages
- Duration: >10 minutes
- Action: Post to Slack #alerts-warning
Queue Backlog Critical (Critical)
- Threshold: >10,000 messages
- Duration: >5 minutes
- Action: Post to Slack #alerts-critical
Processing Error Rate (Warning)
- Threshold: >5% error rate
- Duration: >5 minutes
- Action: Post to Slack #alerts-warning
Bot Operator Down (Critical)
- Threshold: Health check fails
- Duration: >2 minutes
- Action: Post to Slack #alerts-critical

Troubleshooting

Queue Backlog Growing

Symptoms:

Messages piling up in queue
Slow processing rate
PostHog showing decreased throughput

Investigation:

# Check CPU/memory usage
kubectl top pod -n `{namespace}` -l app=bot-operator

# Check logs for errors
kubectl logs -n `{namespace}` deployment/bot-operator --tail=200

# Check worker count
kubectl get deployment bot-operator -n `{namespace}` -o yaml | grep replicas

Fixes:

Scale Up Workers

kubectl scale deployment bot-operator --replicas=5 -n `{namespace}`

Increase Resources

# Edit Helm values
resources:
  limits:
    cpu: 2000m
    memory: 4Gi

Purge Failed Messages

# Move to dead letter queue
aws sqs purge-queue --queue-url `{dlq_url}`

High Error Rate

Investigation:

# Check error logs
kubectl logs -n `{namespace}` deployment/bot-operator | grep ERROR

# Check PostHog error tracking
# Dashboard: Bot Operator → Error Analysis

Common Causes:

External API timeouts
Database connection issues
Invalid message format
Rate limiting

Fixes:

Retry failed messages
Increase API timeout settings
Check database connectivity
Implement exponential backoff

Deployment & Rollback

Deploy New Version

# Build and push new image
docker build -t ghcr.io/egintegrations/bot-operator:v1.2.0 .
docker push ghcr.io/egintegrations/bot-operator:v1.2.0

# Update Helm values
helm upgrade bot-operator ./chart \
  --set image.tag=v1.2.0 \
  --namespace `{namespace}`

# Monitor rollout
kubectl rollout status deployment/bot-operator -n `{namespace}`

# Post in Slack #alerts-info
# "Bot operator updated to v1.2.0"

Rollback

# Rollback to previous version
helm rollback bot-operator -n `{namespace}`

# Verify rollback
kubectl rollout status deployment/bot-operator -n `{namespace}`

# Post in Slack #alerts-critical
# "Bot operator rolled back due to [reason]"

Manual Failover

If bot operator fails completely:

# 1. Stop current deployment
kubectl scale deployment bot-operator --replicas=0 -n `{namespace}`

# 2. Deploy backup instance
kubectl apply -f k8s/bot-operator-backup.yaml

# 3. Redirect queue traffic
# (Update queue subscription or consumer group)

# 4. Verify new instance processing
kubectl logs -f -n `{namespace}` deployment/bot-operator-backup

# 5. Post in Slack #alerts-critical
# "Manual failover to backup bot operator instance"

Monitoring & Alerts

Uptime Kuma Configuration

Create monitors for:

Bot operator health endpoint
Queue depth threshold
Processing rate threshold
Error rate threshold

PostHog Dashboards

Track:

Messages processed per hour
Average processing time
Error rate by error type
Queue depth over time
Worker utilization

CrowdSec Integration

Monitor for:

Suspicious message patterns
Excessive failed authentication attempts
DDoS-style message floods

Maintenance Checklist

Daily:

Check Uptime Kuma status
Review queue backlog in PostHog
Check error logs for anomalies

Weekly:

Review processing rate trends
Analyze error patterns
Check resource utilization
Review CrowdSec security logs

Monthly:

Audit queue dead letter queue
Review and optimize worker configuration
Update documentation
Test failover procedure

Future Documentation Tasks

This runbook will be expanded to include:

Document event queue integrations (SQS, Pub/Sub, etc.)
Define specific alert thresholds for each queue type
Capture detailed redeploy/rollback procedures
Add client-specific configuration examples
Document message format validation
Add performance tuning guidelines
Create troubleshooting decision tree
Add on-call escalation procedures

Emergency Contacts

Bot Operator Issues:

Post in Slack #alerts-critical
Tag platform team lead
Include: namespace, error logs, queue metrics

Queue Service Issues:

AWS SQS: Check AWS console
Google Pub/Sub: Check GCP console
Post issue details in Slack #alerts-critical

Last Updated: 2026-03-25 Version: 1.0 Status: Draft - Update before production deployment

Quick Links​

Common Operations​

Check Bot Operator Status​

Monitor Queue Backlog​

Alert Thresholds​

Troubleshooting​

Queue Backlog Growing​

High Error Rate​

Deployment & Rollback​

Deploy New Version​

Rollback​

Manual Failover​

Monitoring & Alerts​

Uptime Kuma Configuration​

PostHog Dashboards​

CrowdSec Integration​

Maintenance Checklist​

Future Documentation Tasks​

Emergency Contacts​