Skip to main content

Runbook: Unhealthy Service Response

Use this runbook for a production service that is up but degraded, intermittently failing, or failing health checks.

Immediate Checks

  1. Confirm the affected production endpoint and time window.
  2. Check the documented health endpoint.
  3. Review recent alerts, logs, and key metrics.
  4. Confirm whether the issue is isolated or system-wide.

Common Failure Areas

  • recent deployments or configuration changes
  • dependency outage
  • expired or missing credentials
  • resource exhaustion
  • queue backlog or job failure
  • edge or DNS misrouting

Standard Recovery Flow

  1. Stabilize user impact if possible.
  2. Compare current health and metrics with the last known healthy state.
  3. Determine whether rollback is safer than in-place intervention.
  4. If restart is safe and documented, perform the smallest viable restart action.
  5. Validate health, alerts, and error rate after recovery.
  6. Escalate to the development company if code-level remediation is required.

Required References

  • system runbook
  • service dashboard
  • log location
  • alert route
  • rollback procedure
  • ownership and escalation map