Incident Response Maturity: From Tribal Knowledge to Documented Runbooks
The Hidden Knowledge Problem
When the database fills up and stops accepting writes, who knows how to handle it? One person. And that person is on vacation. When a third-party API goes down and starts rejecting requests, how does the team respond? Someone remembers the steps from last time and Slacks them to the on-call engineer. There's no runbook. There's no escalation path. There's no "if this, then that." There's just tribal knowledge—anecdotes passed around the team like folklore.
This is what immature incident response looks like. It feels fine until the person who knows all the steps leaves. Then production goes down and no one knows what to do.
Incident Response Maturity Levels
Level 1: Ad-Hoc
Someone notices an alert (or a customer complains). Then the team scrambles. Steps are figured out by trial and error. One person does most of the work while others watch Slack. When it's over, there's no post-incident review. The same issue will happen again.
Level 2: Documented Runbooks
Each critical issue has a runbook. "Database Full?" Here's the runbook: check disk usage, identify the problem, escalate to DBA if needed. "High Error Rate?" Here's the runbook: check recent deployments, check third-party API status, roll back if needed. New team members can follow the runbooks. Response time is consistent.
Level 3: Structured Response Process
There's an incident commander. There's a clear escalation path. Severity is defined (SEV-1 means all hands, SEV-3 means async handling). Blameless post-incident reviews are standard. Every incident generates a ticket to prevent recurrence. Knowledge builds over time.
Level 4: Proactive Prevention
Incidents are rare because systems are designed to fail gracefully. Alerts trigger on symptoms, not on failures. On-call load is distributed fairly. Chaos testing prevents surprises. Root causes are fixed, not band-aided.
Key Components of Mature IR
A mature incident response program includes:
- Runbooks: Step-by-step guides for common issues. Searchable. Regularly updated.
- Escalation paths: Who to contact and when. Defined by severity.
- Severity definitions: SEV-1 is production down. SEV-3 is degraded performance. Everyone knows what they mean.
- On-call rotation: Fair distribution. Not just one person. Support for new on-call members.
- Post-incident reviews: Blameless analysis of what went wrong and how to prevent it.
- Monitoring and alerting: You catch problems before customers do.
- Communication templates: How to inform users about ongoing incidents.
How Concordance Scores It
Concordance analyzes your incident management system (Pagerduty, Opsgenie, etc.) and your documentation. It measures:
- Do you have documented runbooks for your critical services?
- Do you have defined severity levels and escalation paths?
- What's your incident frequency? (Fewer is better.)
- What's your MTTR (mean time to recovery)?
- Do you track post-incident actions? Are they followed up?
- How distributed is your on-call load?
CRA Relevance: Reporting Obligations
The CRA requires incident reporting. An unresolved significant vulnerability must be reported as an early warning within 24 hours of discovery. Incident notifications are due within 72 hours. Final reports are due within 14 days. You can only meet these deadlines if your incident response is fast and documented.
Beyond the reporting, CRA auditors want to see that you have a process. That incidents are tracked. That actions are taken to prevent recurrence. Mature incident response is the evidence you need to demonstrate this.
Related Guides
Assess your incident response maturity and identify gaps in runbooks and processes.