Skip to main content

Runbook: Unhealthy Service Response

Use this runbook for a production service that is up but degraded, intermittently failing, or failing health checks.

Immediate Checks

Confirm the affected production endpoint and time window.
Check the documented health endpoint.
Review recent alerts, logs, and key metrics.
Confirm whether the issue is isolated or system-wide.

Common Failure Areas

recent deployments or configuration changes
dependency outage
expired or missing credentials
resource exhaustion
queue backlog or job failure
edge or DNS misrouting

Standard Recovery Flow

Stabilize user impact if possible.
Compare current health and metrics with the last known healthy state.
Determine whether rollback is safer than in-place intervention.
If restart is safe and documented, perform the smallest viable restart action.
Validate health, alerts, and error rate after recovery.
Escalate to the development company if code-level remediation is required.

Required References

system runbook
service dashboard
log location
alert route
rollback procedure
ownership and escalation map

Immediate Checks
Common Failure Areas
Standard Recovery Flow
Required References