Incident Report: [Service] - [Date]
Replace all text in [brackets] with your content. Delete this callout when done. This report should be blameless - focus on systems and processes, not individuals.
Executive Summary
[2-3 sentence summary of what happened, the impact, and the resolution. This should be understandable by non-technical stakeholders.]
Incident Details
| Field | Value |
|---|---|
| Incident ID | [INC-YYYY-MM-DD-XXX] |
| Service Affected | [Service name] |
| Date/Time | [YYYY-MM-DD HH:MM UTC] |
| Duration | [Total downtime/degradation] |
| Severity | P0 / P1 / P2 / P3 / P4 |
| Incident Commander | [Name] |
| Status | 🔴 Ongoing / 🟡 Mitigated / 🟢 Resolved |
Severity Definitions
- P0: Complete outage, business-critical impact
- P1: Partial outage, major feature unavailable
- P2: Performance degradation, workaround available
- P3: Minor issue, minimal impact
- P4: Cosmetic issue, no functional impact
Timeline
Use UTC times for all timestamps. Include timezone conversions if relevant to stakeholders.
Detection
[HH:MM UTC] - First Alert
- [Source of alert: monitoring, user report, etc.]
- [What was detected]
[HH:MM UTC] - Incident Confirmed
- [Who confirmed]
- [Initial assessment]
Investigation
[HH:MM UTC] - Investigation Began
- [Initial actions taken]
- [Hypotheses explored]
[HH:MM UTC] - [Key Finding 1]
- [What was discovered]
- [Actions taken]
[HH:MM UTC] - [Key Finding 2]
- [What was discovered]
- [Actions taken]
[HH:MM UTC] - Root Cause Identified
- [What was determined to be the root cause]
Resolution
[HH:MM UTC] - Mitigation Started
- [What mitigation was applied]
- [Who applied it]
[HH:MM UTC] - Service Restored
- [What action restored service]
- [Verification performed]
[HH:MM UTC] - Incident Declared Resolved
- [Final verification]
- [All-clear given by]
Post-Incident
[HH:MM UTC] - Post-Incident Review Scheduled
- [Date/time of review]
- [Attendees]
Impact Assessment
User Impact
Users Affected: [Number or percentage of users]
Geographic Impact: [All regions, specific regions, etc.]
User-Facing Impact:
- [What users experienced]
- [What functionality was unavailable]
- [What errors users saw]
Business Impact:
- Lost transactions: [Number/value if quantifiable]
- Support tickets: [Number created]
- Customer complaints: [Number/severity]
- Revenue impact: [If applicable and calculable]
System Impact
Services Affected:
- [Service 1] - [Impact level]
- [Service 2] - [Impact level]
- [Service 3] - [Impact level]
Data Impact:
- Data loss: ✅ None / ⚠️ Partial / ❌ Significant
- Data integrity: ✅ Maintained / ⚠️ Questionable / ❌ Compromised
- Backup status: ✅ Current / ⚠️ Stale / ❌ Failed
Metrics
| Metric | Normal | During Incident | Peak Impact |
|---|---|---|---|
| Uptime | 99.9% | [%] | [%] |
| Response Time | [Xms] | [Xms] | [Xms] |
| Error Rate | [%] | [%] | [%] |
| Request Volume | [N/min] | [N/min] | [N/min] |
Root Cause Analysis
Contributing Factors
Immediate Cause: [What directly caused the incident? The proximate trigger.]
Root Cause: [What systemic issue allowed the immediate cause to result in an incident? Why did our defenses fail?]
Contributing Factors:
- [Factor 1] - [How it contributed]
- [Factor 2] - [How it contributed]
- [Factor 3] - [How it contributed]
What Went Wrong
Technical Issues:
- [Issue 1]
- [Issue 2]
Process Issues:
- [Gap in process 1]
- [Gap in process 2]
Monitoring/Alerting Issues:
- [What we didn't detect]
- [Alerting delays]
What Went Well
Always include what went well. Recognize good decisions and effective responses.
Effective Responses:
- [Good decision 1]
- [Good decision 2]
Working Systems:
- [System that worked as expected]
- [Process that worked well]
Resolution Details
Immediate Mitigation
Actions Taken:
- [Action 1]
# Commands executed if applicable
[command] - [Action 2]
- [Action 3]
Why This Worked: [Explanation of why the mitigation was effective]
Permanent Fix
Status: 🔴 Not Started / 🟡 In Progress / 🟢 Complete
Implementation:
- [Permanent fix step 1]
- [Permanent fix step 2]
- [Permanent fix step 3]
Timeline: [When permanent fix will be deployed]
Verification: [How we'll verify the fix works]
Action Items
Immediate Actions (Complete within 48 hours)
- [Action 1] - Owner: [Name] - Due: [Date]
- [Action 2] - Owner: [Name] - Due: [Date]
- [Action 3] - Owner: [Name] - Due: [Date]
Short-term Actions (Complete within 2 weeks)
- [Action 1] - Owner: [Name] - Due: [Date]
- [Action 2] - Owner: [Name] - Due: [Date]
- [Action 3] - Owner: [Name] - Due: [Date]
Long-term Actions (Complete within 3 months)
- [Action 1] - Owner: [Name] - Due: [Date]
- [Action 2] - Owner: [Name] - Due: [Date]
- [Action 3] - Owner: [Name] - Due: [Date]
Process Improvements
- [Update SOP for X] - Owner: [Name] - Due: [Date]
- [Add monitoring for Y] - Owner: [Name] - Due: [Date]
- [Document runbook for Z] - Owner: [Name] - Due: [Date]
Preventive Measures
To Prevent Recurrence:
- [Technical improvement 1]
- [Process improvement 1]
- [Monitoring improvement 1]
To Detect Earlier:
- [Alert to add]
- [Monitoring to enhance]
- [Dashboard to create]
To Mitigate Faster:
- [Runbook to create]
- [Automation to build]
- [Training to provide]
Communication
Internal Communication
Channels Used:
- [Slack channel]
- [Email list]
- [Incident management tool]
Communication Timeline:
- [HH:MM UTC] - Internal alert sent
- [HH:MM UTC] - Status update 1
- [HH:MM UTC] - Status update 2
- [HH:MM UTC] - Resolution communicated
Effectiveness:
- What worked: [Effective communication]
- What didn't: [Communication gaps]
External Communication
Status Page Updates:
- [HH:MM UTC] - Investigating
- [HH:MM UTC] - Identified
- [HH:MM UTC] - Monitoring
- [HH:MM UTC] - Resolved
Customer Communication:
- Email sent: [Yes/No] - [When]
- Support notified: [Yes/No] - [When]
- Social media: [Any posts]
Stakeholder Communication:
- [Stakeholder 1]: [When/how notified]
- [Stakeholder 2]: [When/how notified]
Lessons Learned
What We Learned
-
[Lesson 1]
- [Details]
- [How we'll apply this learning]
-
[Lesson 2]
- [Details]
- [How we'll apply this learning]
-
[Lesson 3]
- [Details]
- [How we'll apply this learning]
Knowledge Gaps Identified
- [Gap 1] - [How we'll address it]
- [Gap 2] - [How we'll address it]
Documentation Updates Needed
- Update system profile: [What to add]
- Create runbook: [For what scenario]
- Update SOP: [What changes]
- Document tribal knowledge: [What to capture]
Technical Details
Include technical details here if useful for engineering team. Can be excluded from executive summary version.
System Architecture Relevant to Incident
[Diagram or description of the system components involved]
Logs and Evidence
Key Log Entries:
[Relevant log entries that show the problem]
Monitoring Graphs:
- [Link to graph 1]
- [Link to graph 2]
Related Tickets:
- [Link to incident ticket]
- [Link to related bugs]
- [Link to follow-up work]
Code Changes Related to Incident
Potentially Related Deployments:
- [Deployment 1]: [Date/time] - [Link to commit]
- [Deployment 2]: [Date/time] - [Link to commit]
Root Cause Commit:
- Repository: [Repo name]
- Commit: [Hash]
- Date: [When merged]
- Link: [GitHub link]
Follow-up
Post-Incident Review
Date: [YYYY-MM-DD]
Attendees:
- [Name, Role]
- [Name, Role]
- [Name, Role]
Key Discussion Points:
- [Point 1]
- [Point 2]
Decisions Made:
- [Decision 1]
- [Decision 2]
Action Item Tracking
Review Cadence: [Weekly until complete]
Review Owner: [Name]
Progress Tracking: [Link to project board / Jira]
SLA Impact
Uptime SLA: [Target: X%, Actual: Y%]
SLA Met: ✅ Yes / ❌ No
SLA Credit: [If applicable]
Related Documentation
- [System Profile: [Service]]
- [Runbook: [Related Runbook]]
- [SOP: [Related SOP]]
- [ADR: [Related Architecture Decision]]
- [Previous Similar Incidents]
Appendices
Appendix A: Full Command History
# All commands executed during incident response
[timestamp] [command]
[timestamp] [command]
Appendix B: Complete Timeline with All Actions
[More detailed timeline if main timeline was summarized]
Appendix C: Communication Templates Used
[Templates used for status updates]
Sign-off
- Report Author: [Name] - [Date]
- Reviewed By: [Name, Role] - [Date]
- Approved By: [Name, Role] - [Date]
- Next Review Date: [Date] - [For action item progress check]
Document Version: 1.0 Last Updated: [YYYY-MM-DD] Status: Draft / Under Review / Published