🚨 SRE

Incident Severity Matrix

Select impact and urgency to instantly determine the correct SEV level, response SLA, escalation path and initial communication template. Built for SREs, on-call engineers and incident commanders.

⚙️ Select Impact & Urgency
🎯 Business Impact
⏱ Urgency
📊 Severity Matrix — click any cell to select
Impact ↓ / Urgency → ImmediateHighMediumLow
CriticalSEV1SEV1SEV2SEV3
High SEV1SEV2SEV2SEV3
Medium SEV2SEV3SEV3SEV4
Low SEV3SEV4SEV4SEV5
📋 Severity Level Reference
SEV1Critical — immediate all-hands response, CEO/CTO visibility, comms every 15 min
SEV2Major — senior on-call + team lead, comms every 30 min, resolve within 4h
SEV3Significant — on-call engineer leads, comms every hour, resolve within 24h
SEV4Minor — ticket created, no on-call page, resolve within sprint
SEV5Cosmetic — backlog ticket, nice to have, no SLA
📖 How to Use This Tool
1
Select Business Impact (Critical to Low)
2
Select Urgency (Immediate to Low)
3
Get SEV level, response SLA and escalation path
4
Copy the communication template
📝 Examples
Outage
Input: Critical + Immediate
Output: SEV1 — All hands, 15min comms

What is the Incident Severity Matrix?

The Incident Severity Matrix is a structured decision tool that helps SREs, on-call engineers, and incident commanders classify the severity of a production incident by evaluating two dimensions: business impact and urgency. By selecting the appropriate impact level (Critical, High, Medium, Low) and urgency level (Immediate, High, Medium, Low), the tool maps your selection to a severity level from SEV1 (most critical) to SEV5 (cosmetic), and immediately provides the corresponding response SLA, escalation path, incident response checklist, and a pre-written communication template ready to post to Slack or a status page.

Consistent severity classification is one of the most important — and most commonly neglected — pillars of effective incident management. When engineers classify incidents inconsistently under pressure, teams page the wrong people, set incorrect stakeholder expectations, and generate misleading incident metrics that make it hard to trend improving or worsening reliability. A published severity matrix that all team members can reference eliminates ambiguity and lets on-call engineers focus their energy on resolution rather than debating classification.

When to Use This Tool

How It Works

The tool implements a standard 4x4 impact-urgency matrix — a well-established framework used by IT service management (ITSM) methodologies including ITIL — mapping all sixteen combinations of four impact levels and four urgency levels to a severity level between SEV1 and SEV5. Each severity level has a defined set of attributes: time-to-acknowledge, time-to-resolve, communications cadence, on-call scope, escalation chain, a response checklist, and an initial communication template with placeholders for incident-specific details. The templates follow the same structure used by leading SRE organisations and can be copied directly into Slack, PagerDuty, or your incident management platform.

Frequently Asked Questions

How are incident severity levels determined?

Severity is determined by crossing two independent dimensions on a matrix: Business Impact and Urgency. Business Impact describes how severely the incident affects users or business operations — Critical means a complete outage or data loss affecting all users, while Low means a minor issue affecting very few users with a workaround available. Urgency describes how quickly the issue must be resolved — Immediate means right now, while Low means it can wait for the next sprint. The intersection of these two dimensions produces the severity level. A critical impact with immediate urgency is always SEV1, while a low-impact issue with low urgency is SEV5. This two-dimensional approach prevents under-classifying slow-burn issues with high impact and prevents over-classifying urgent but minor issues.

What is a typical SEV1 response SLA?

A typical SEV1 incident requires acknowledgment within 5 minutes of the alert firing, initial stakeholder communications within 15 minutes, the engineering on-call team assembled in a war room bridge call within 15 minutes, and status page updates every 15 to 30 minutes until resolution. The incident commander role should be assigned within the first 10 minutes to ensure clear ownership of the response. After resolution, a post-mortem (blameless retrospective) should be scheduled within 24 hours while the details are fresh, and the written post-mortem document should be published within 5 business days. These timelines are guidelines — organisations with strict contractual SLAs in their customer agreements may define tighter requirements.

Should every organisation use the same severity level definitions?

No — severity definitions must be calibrated to each organisation's scale, customer commitments, and risk tolerance. A startup with 5,000 users serving a non-critical B2B tool may reasonably treat most outages as SEV2, while a payments infrastructure company with millions of transactions per hour might define SEV1 only when revenue impact exceeds a specific threshold per minute. The exact thresholds matter less than having clear, unambiguous criteria that are documented, agreed upon by leadership and engineering teams, and stable enough that on-call engineers can apply them quickly under stress without needing to escalate the classification decision itself. Review and calibrate your severity definitions at least annually, or whenever your user base or product criticality changes significantly.