Bot Operator Operations Runbook
This runbook reflects the retired engine-era operating model and is preserved for reference only. It is not part of the current golden path.
Status: In Development Owner: Platform Team
This runbook covers operational procedures for bot operator services that handle background task processing and event queue management.
Quick Links
- Service manifest:
engines/client-sample.bot-operator.yaml - ApplicationSet:
k8s/argocd/applicationset-bot-operator.yaml - Source:
apps/bot-operator/ - Uptime Kuma: Bot operator service monitors
- Slack Alerts:
#alerts-warning,#alerts-critical
Common Operations
Check Bot Operator Status
# Check pod status
kubectl get pods -n `{namespace}` -l app=bot-operator
# Check logs
kubectl logs -n `{namespace}` deployment/bot-operator --tail=100
# Check health endpoint
kubectl port-forward -n `{namespace}` svc/bot-operator 8080:80
curl http://localhost:8080/.well-known/engine-status
Monitor Queue Backlog
Event Queue Integrations:
- AWS SQS
- Google Pub/Sub
- Redis Queue
- RabbitMQ
# Check queue depth (example for SQS)
aws sqs get-queue-attributes \
--queue-url `{queue_url}` \
--attribute-names ApproximateNumberOfMessages
# Check processing rate via PostHog
# Dashboard: Bot Operator → Queue Processing Metrics
Alert Thresholds
Configure Uptime Kuma monitors:
-
Queue Backlog Alert (Warning)
- Threshold: >1000 messages
- Duration: >10 minutes
- Action: Post to Slack
#alerts-warning
-
Queue Backlog Critical (Critical)
- Threshold: >10,000 messages
- Duration: >5 minutes
- Action: Post to Slack
#alerts-critical
-
Processing Error Rate (Warning)
- Threshold: >5% error rate
- Duration: >5 minutes
- Action: Post to Slack
#alerts-warning
-
Bot Operator Down (Critical)
- Threshold: Health check fails
- Duration: >2 minutes
- Action: Post to Slack
#alerts-critical
Troubleshooting
Queue Backlog Growing
Symptoms:
- Messages piling up in queue
- Slow processing rate
- PostHog showing decreased throughput
Investigation:
# Check CPU/memory usage
kubectl top pod -n `{namespace}` -l app=bot-operator
# Check logs for errors
kubectl logs -n `{namespace}` deployment/bot-operator --tail=200
# Check worker count
kubectl get deployment bot-operator -n `{namespace}` -o yaml | grep replicas
Fixes:
-
Scale Up Workers
kubectl scale deployment bot-operator --replicas=5 -n `{namespace}` -
Increase Resources
# Edit Helm values
resources:
limits:
cpu: 2000m
memory: 4Gi -
Purge Failed Messages
# Move to dead letter queue
aws sqs purge-queue --queue-url `{dlq_url}`
High Error Rate
Investigation:
# Check error logs
kubectl logs -n `{namespace}` deployment/bot-operator | grep ERROR
# Check PostHog error tracking
# Dashboard: Bot Operator → Error Analysis
Common Causes:
- External API timeouts
- Database connection issues
- Invalid message format
- Rate limiting
Fixes:
- Retry failed messages
- Increase API timeout settings
- Check database connectivity
- Implement exponential backoff
Deployment & Rollback
Deploy New Version
# Build and push new image
docker build -t ghcr.io/egintegrations/bot-operator:v1.2.0 .
docker push ghcr.io/egintegrations/bot-operator:v1.2.0
# Update Helm values
helm upgrade bot-operator ./chart \
--set image.tag=v1.2.0 \
--namespace `{namespace}`
# Monitor rollout
kubectl rollout status deployment/bot-operator -n `{namespace}`
# Post in Slack #alerts-info
# "Bot operator updated to v1.2.0"
Rollback
# Rollback to previous version
helm rollback bot-operator -n `{namespace}`
# Verify rollback
kubectl rollout status deployment/bot-operator -n `{namespace}`
# Post in Slack #alerts-critical
# "Bot operator rolled back due to [reason]"
Manual Failover
If bot operator fails completely:
# 1. Stop current deployment
kubectl scale deployment bot-operator --replicas=0 -n `{namespace}`
# 2. Deploy backup instance
kubectl apply -f k8s/bot-operator-backup.yaml
# 3. Redirect queue traffic
# (Update queue subscription or consumer group)
# 4. Verify new instance processing
kubectl logs -f -n `{namespace}` deployment/bot-operator-backup
# 5. Post in Slack #alerts-critical
# "Manual failover to backup bot operator instance"
Monitoring & Alerts
Uptime Kuma Configuration
Create monitors for:
- Bot operator health endpoint
- Queue depth threshold
- Processing rate threshold
- Error rate threshold
PostHog Dashboards
Track:
- Messages processed per hour
- Average processing time
- Error rate by error type
- Queue depth over time
- Worker utilization
CrowdSec Integration
Monitor for:
- Suspicious message patterns
- Excessive failed authentication attempts
- DDoS-style message floods
Maintenance Checklist
Daily:
- Check Uptime Kuma status
- Review queue backlog in PostHog
- Check error logs for anomalies
Weekly:
- Review processing rate trends
- Analyze error patterns
- Check resource utilization
- Review CrowdSec security logs
Monthly:
- Audit queue dead letter queue
- Review and optimize worker configuration
- Update documentation
- Test failover procedure
Future Documentation Tasks
This runbook will be expanded to include:
- Document event queue integrations (SQS, Pub/Sub, etc.)
- Define specific alert thresholds for each queue type
- Capture detailed redeploy/rollback procedures
- Add client-specific configuration examples
- Document message format validation
- Add performance tuning guidelines
- Create troubleshooting decision tree
- Add on-call escalation procedures
Emergency Contacts
Bot Operator Issues:
- Post in Slack
#alerts-critical - Tag platform team lead
- Include: namespace, error logs, queue metrics
Queue Service Issues:
- AWS SQS: Check AWS console
- Google Pub/Sub: Check GCP console
- Post issue details in Slack
#alerts-critical
Last Updated: 2026-03-25 Version: 1.0 Status: Draft - Update before production deployment