Incident Severity Matrix
Select impact and urgency to instantly determine the correct SEV level, response SLA, escalation path and initial communication template. Built for SREs, on-call engineers and incident commanders.
| Impact ↓ / Urgency → | Immediate | High | Medium | Low |
|---|---|---|---|---|
| Critical | SEV1 | SEV1 | SEV2 | SEV3 |
| High | SEV1 | SEV2 | SEV2 | SEV3 |
| Medium | SEV2 | SEV3 | SEV3 | SEV4 |
| Low | SEV3 | SEV4 | SEV4 | SEV5 |
What is the Incident Severity Matrix?
The Incident Severity Matrix is a structured decision tool that helps SREs, on-call engineers, and incident commanders classify the severity of a production incident by evaluating two dimensions: business impact and urgency. By selecting the appropriate impact level (Critical, High, Medium, Low) and urgency level (Immediate, High, Medium, Low), the tool maps your selection to a severity level from SEV1 (most critical) to SEV5 (cosmetic), and immediately provides the corresponding response SLA, escalation path, incident response checklist, and a pre-written communication template ready to post to Slack or a status page.
Consistent severity classification is one of the most important — and most commonly neglected — pillars of effective incident management. When engineers classify incidents inconsistently under pressure, teams page the wrong people, set incorrect stakeholder expectations, and generate misleading incident metrics that make it hard to trend improving or worsening reliability. A published severity matrix that all team members can reference eliminates ambiguity and lets on-call engineers focus their energy on resolution rather than debating classification.
When to Use This Tool
- On-call runbook creation: Use the matrix and response templates as the starting point for your team's incident runbook, customising the escalation paths and SLA times to match your organisation's commitments.
- Live incident classification: During an active incident, quickly select impact and urgency to determine the correct SEV level and immediately receive the response checklist and communication template.
- Incident management process design: When setting up or reviewing your on-call rotation and incident process, use this matrix to ensure your severity definitions are internally consistent and cover all combinations of impact and urgency.
- New team member training: Walk through example scenarios with junior engineers to build intuition for when to escalate and what each severity level demands from responders.
How It Works
The tool implements a standard 4x4 impact-urgency matrix — a well-established framework used by IT service management (ITSM) methodologies including ITIL — mapping all sixteen combinations of four impact levels and four urgency levels to a severity level between SEV1 and SEV5. Each severity level has a defined set of attributes: time-to-acknowledge, time-to-resolve, communications cadence, on-call scope, escalation chain, a response checklist, and an initial communication template with placeholders for incident-specific details. The templates follow the same structure used by leading SRE organisations and can be copied directly into Slack, PagerDuty, or your incident management platform.
Frequently Asked Questions
How are incident severity levels determined?
Severity is determined by crossing two independent dimensions on a matrix: Business Impact and Urgency. Business Impact describes how severely the incident affects users or business operations — Critical means a complete outage or data loss affecting all users, while Low means a minor issue affecting very few users with a workaround available. Urgency describes how quickly the issue must be resolved — Immediate means right now, while Low means it can wait for the next sprint. The intersection of these two dimensions produces the severity level. A critical impact with immediate urgency is always SEV1, while a low-impact issue with low urgency is SEV5. This two-dimensional approach prevents under-classifying slow-burn issues with high impact and prevents over-classifying urgent but minor issues.
What is a typical SEV1 response SLA?
A typical SEV1 incident requires acknowledgment within 5 minutes of the alert firing, initial stakeholder communications within 15 minutes, the engineering on-call team assembled in a war room bridge call within 15 minutes, and status page updates every 15 to 30 minutes until resolution. The incident commander role should be assigned within the first 10 minutes to ensure clear ownership of the response. After resolution, a post-mortem (blameless retrospective) should be scheduled within 24 hours while the details are fresh, and the written post-mortem document should be published within 5 business days. These timelines are guidelines — organisations with strict contractual SLAs in their customer agreements may define tighter requirements.
Should every organisation use the same severity level definitions?
No — severity definitions must be calibrated to each organisation's scale, customer commitments, and risk tolerance. A startup with 5,000 users serving a non-critical B2B tool may reasonably treat most outages as SEV2, while a payments infrastructure company with millions of transactions per hour might define SEV1 only when revenue impact exceeds a specific threshold per minute. The exact thresholds matter less than having clear, unambiguous criteria that are documented, agreed upon by leadership and engineering teams, and stable enough that on-call engineers can apply them quickly under stress without needing to escalate the classification decision itself. Review and calibrate your severity definitions at least annually, or whenever your user base or product criticality changes significantly.