Runbook: Unhealthy Service Response
Use this runbook for a production service that is up but degraded, intermittently failing, or failing health checks.
Immediate Checks
- Confirm the affected production endpoint and time window.
- Check the documented health endpoint.
- Review recent alerts, logs, and key metrics.
- Confirm whether the issue is isolated or system-wide.
Common Failure Areas
- recent deployments or configuration changes
- dependency outage
- expired or missing credentials
- resource exhaustion
- queue backlog or job failure
- edge or DNS misrouting
Standard Recovery Flow
- Stabilize user impact if possible.
- Compare current health and metrics with the last known healthy state.
- Determine whether rollback is safer than in-place intervention.
- If restart is safe and documented, perform the smallest viable restart action.
- Validate health, alerts, and error rate after recovery.
- Escalate to the development company if code-level remediation is required.
Required References
- system runbook
- service dashboard
- log location
- alert route
- rollback procedure
- ownership and escalation map