Skip to main content

Incident Report: [Service] - [Date]

Template Instructions

Replace all text in [brackets] with your content. Delete this callout when done. This report should be blameless - focus on systems and processes, not individuals.

Executive Summary

[2-3 sentence summary of what happened, the impact, and the resolution. This should be understandable by non-technical stakeholders.]

Incident Details

FieldValue
Incident ID[INC-YYYY-MM-DD-XXX]
Service Affected[Service name]
Date/Time[YYYY-MM-DD HH:MM UTC]
Duration[Total downtime/degradation]
SeverityP0 / P1 / P2 / P3 / P4
Incident Commander[Name]
Status🔴 Ongoing / 🟡 Mitigated / 🟢 Resolved

Severity Definitions

  • P0: Complete outage, business-critical impact
  • P1: Partial outage, major feature unavailable
  • P2: Performance degradation, workaround available
  • P3: Minor issue, minimal impact
  • P4: Cosmetic issue, no functional impact

Timeline

tip

Use UTC times for all timestamps. Include timezone conversions if relevant to stakeholders.

Detection

[HH:MM UTC] - First Alert

  • [Source of alert: monitoring, user report, etc.]
  • [What was detected]

[HH:MM UTC] - Incident Confirmed

  • [Who confirmed]
  • [Initial assessment]

Investigation

[HH:MM UTC] - Investigation Began

  • [Initial actions taken]
  • [Hypotheses explored]

[HH:MM UTC] - [Key Finding 1]

  • [What was discovered]
  • [Actions taken]

[HH:MM UTC] - [Key Finding 2]

  • [What was discovered]
  • [Actions taken]

[HH:MM UTC] - Root Cause Identified

  • [What was determined to be the root cause]

Resolution

[HH:MM UTC] - Mitigation Started

  • [What mitigation was applied]
  • [Who applied it]

[HH:MM UTC] - Service Restored

  • [What action restored service]
  • [Verification performed]

[HH:MM UTC] - Incident Declared Resolved

  • [Final verification]
  • [All-clear given by]

Post-Incident

[HH:MM UTC] - Post-Incident Review Scheduled

  • [Date/time of review]
  • [Attendees]

Impact Assessment

User Impact

Users Affected: [Number or percentage of users]

Geographic Impact: [All regions, specific regions, etc.]

User-Facing Impact:

  • [What users experienced]
  • [What functionality was unavailable]
  • [What errors users saw]

Business Impact:

  • Lost transactions: [Number/value if quantifiable]
  • Support tickets: [Number created]
  • Customer complaints: [Number/severity]
  • Revenue impact: [If applicable and calculable]

System Impact

Services Affected:

  • [Service 1] - [Impact level]
  • [Service 2] - [Impact level]
  • [Service 3] - [Impact level]

Data Impact:

  • Data loss: ✅ None / ⚠️ Partial / ❌ Significant
  • Data integrity: ✅ Maintained / ⚠️ Questionable / ❌ Compromised
  • Backup status: ✅ Current / ⚠️ Stale / ❌ Failed

Metrics

MetricNormalDuring IncidentPeak Impact
Uptime99.9%[%][%]
Response Time[Xms][Xms][Xms]
Error Rate[%][%][%]
Request Volume[N/min][N/min][N/min]

Root Cause Analysis

Contributing Factors

Immediate Cause: [What directly caused the incident? The proximate trigger.]

Root Cause: [What systemic issue allowed the immediate cause to result in an incident? Why did our defenses fail?]

Contributing Factors:

  1. [Factor 1] - [How it contributed]
  2. [Factor 2] - [How it contributed]
  3. [Factor 3] - [How it contributed]

What Went Wrong

Technical Issues:

  • [Issue 1]
  • [Issue 2]

Process Issues:

  • [Gap in process 1]
  • [Gap in process 2]

Monitoring/Alerting Issues:

  • [What we didn't detect]
  • [Alerting delays]

What Went Well

Blameless Culture

Always include what went well. Recognize good decisions and effective responses.

Effective Responses:

  • [Good decision 1]
  • [Good decision 2]

Working Systems:

  • [System that worked as expected]
  • [Process that worked well]

Resolution Details

Immediate Mitigation

Actions Taken:

  1. [Action 1]
    # Commands executed if applicable
    [command]
  2. [Action 2]
  3. [Action 3]

Why This Worked: [Explanation of why the mitigation was effective]

Permanent Fix

Status: 🔴 Not Started / 🟡 In Progress / 🟢 Complete

Implementation:

  1. [Permanent fix step 1]
  2. [Permanent fix step 2]
  3. [Permanent fix step 3]

Timeline: [When permanent fix will be deployed]

Verification: [How we'll verify the fix works]

Action Items

Immediate Actions (Complete within 48 hours)

  • [Action 1] - Owner: [Name] - Due: [Date]
  • [Action 2] - Owner: [Name] - Due: [Date]
  • [Action 3] - Owner: [Name] - Due: [Date]

Short-term Actions (Complete within 2 weeks)

  • [Action 1] - Owner: [Name] - Due: [Date]
  • [Action 2] - Owner: [Name] - Due: [Date]
  • [Action 3] - Owner: [Name] - Due: [Date]

Long-term Actions (Complete within 3 months)

  • [Action 1] - Owner: [Name] - Due: [Date]
  • [Action 2] - Owner: [Name] - Due: [Date]
  • [Action 3] - Owner: [Name] - Due: [Date]

Process Improvements

  • [Update SOP for X] - Owner: [Name] - Due: [Date]
  • [Add monitoring for Y] - Owner: [Name] - Due: [Date]
  • [Document runbook for Z] - Owner: [Name] - Due: [Date]

Preventive Measures

To Prevent Recurrence:

  • [Technical improvement 1]
  • [Process improvement 1]
  • [Monitoring improvement 1]

To Detect Earlier:

  • [Alert to add]
  • [Monitoring to enhance]
  • [Dashboard to create]

To Mitigate Faster:

  • [Runbook to create]
  • [Automation to build]
  • [Training to provide]

Communication

Internal Communication

Channels Used:

  • [Slack channel]
  • [Email list]
  • [Incident management tool]

Communication Timeline:

  • [HH:MM UTC] - Internal alert sent
  • [HH:MM UTC] - Status update 1
  • [HH:MM UTC] - Status update 2
  • [HH:MM UTC] - Resolution communicated

Effectiveness:

  • What worked: [Effective communication]
  • What didn't: [Communication gaps]

External Communication

Status Page Updates:

  • [HH:MM UTC] - Investigating
  • [HH:MM UTC] - Identified
  • [HH:MM UTC] - Monitoring
  • [HH:MM UTC] - Resolved

Customer Communication:

  • Email sent: [Yes/No] - [When]
  • Support notified: [Yes/No] - [When]
  • Social media: [Any posts]

Stakeholder Communication:

  • [Stakeholder 1]: [When/how notified]
  • [Stakeholder 2]: [When/how notified]

Lessons Learned

What We Learned

  1. [Lesson 1]

    • [Details]
    • [How we'll apply this learning]
  2. [Lesson 2]

    • [Details]
    • [How we'll apply this learning]
  3. [Lesson 3]

    • [Details]
    • [How we'll apply this learning]

Knowledge Gaps Identified

  • [Gap 1] - [How we'll address it]
  • [Gap 2] - [How we'll address it]

Documentation Updates Needed

  • Update system profile: [What to add]
  • Create runbook: [For what scenario]
  • Update SOP: [What changes]
  • Document tribal knowledge: [What to capture]

Technical Details

Optional Section

Include technical details here if useful for engineering team. Can be excluded from executive summary version.

System Architecture Relevant to Incident

[Diagram or description of the system components involved]

Logs and Evidence

Key Log Entries:

[Relevant log entries that show the problem]

Monitoring Graphs:

  • [Link to graph 1]
  • [Link to graph 2]

Related Tickets:

  • [Link to incident ticket]
  • [Link to related bugs]
  • [Link to follow-up work]

Potentially Related Deployments:

  • [Deployment 1]: [Date/time] - [Link to commit]
  • [Deployment 2]: [Date/time] - [Link to commit]

Root Cause Commit:

  • Repository: [Repo name]
  • Commit: [Hash]
  • Date: [When merged]
  • Link: [GitHub link]

Follow-up

Post-Incident Review

Date: [YYYY-MM-DD]

Attendees:

  • [Name, Role]
  • [Name, Role]
  • [Name, Role]

Key Discussion Points:

  • [Point 1]
  • [Point 2]

Decisions Made:

  • [Decision 1]
  • [Decision 2]

Action Item Tracking

Review Cadence: [Weekly until complete]

Review Owner: [Name]

Progress Tracking: [Link to project board / Jira]

SLA Impact

Uptime SLA: [Target: X%, Actual: Y%]

SLA Met: ✅ Yes / ❌ No

SLA Credit: [If applicable]

  • [System Profile: [Service]]
  • [Runbook: [Related Runbook]]
  • [SOP: [Related SOP]]
  • [ADR: [Related Architecture Decision]]
  • [Previous Similar Incidents]

Appendices

Appendix A: Full Command History

# All commands executed during incident response
[timestamp] [command]
[timestamp] [command]

Appendix B: Complete Timeline with All Actions

[More detailed timeline if main timeline was summarized]

Appendix C: Communication Templates Used

[Templates used for status updates]

Sign-off

  • Report Author: [Name] - [Date]
  • Reviewed By: [Name, Role] - [Date]
  • Approved By: [Name, Role] - [Date]
  • Next Review Date: [Date] - [For action item progress check]

Document Version: 1.0 Last Updated: [YYYY-MM-DD] Status: Draft / Under Review / Published