🚨 SRE

Incident Severity Matrix

Select impact and urgency to instantly determine the correct SEV level, response SLA, escalation path and initial communication template. Built for SREs, on-call engineers and incident commanders.

⚙️ Select Impact & Urgency

🎯 Business Impact

⏱ Urgency

📊 Severity Matrix — click any cell to select

Impact ↓ / Urgency →	Immediate	High	Medium	Low
Critical	SEV1	SEV1	SEV2	SEV3
High	SEV1	SEV2	SEV2	SEV3
Medium	SEV2	SEV3	SEV3	SEV4
Low	SEV3	SEV4	SEV4	SEV5

📋 Severity Level Reference

SEV1Critical — immediate all-hands response, CEO/CTO visibility, comms every 15 min

SEV2Major — senior on-call + team lead, comms every 30 min, resolve within 4h

SEV3Significant — on-call engineer leads, comms every hour, resolve within 24h

SEV4Minor — ticket created, no on-call page, resolve within sprint

SEV5Cosmetic — backlog ticket, nice to have, no SLA

📖 How to Use This Tool

▼

Select Business Impact (Critical to Low)

Select Urgency (Immediate to Low)

Get SEV level, response SLA and escalation path

Copy the communication template

📝 Examples

Outage

Input: Critical + Immediate

Output: SEV1 — All hands, 15min comms

A Framework Borrowed, Not Invented

The impact-versus-urgency grid this tool implements didn't originate in software operations at all — it's a direct descendant of IT service management (ITSM) practice, most visibly formalized in ITIL's incident prioritization matrix, which has been classifying support tickets this way since long before "SRE" was a job title. What changed when web-scale engineering organisations adopted it is the vocabulary and the speed: ITIL's matrix was built for a help-desk ticket queue measured in hours or days, while an SRE team applies the identical two-axis logic to decisions that need to be made in the first ninety seconds after a page fires. The underlying idea survived the transplant because it solves the same problem in both settings — replacing an argument about severity with a lookup.

Sixteen Boxes Collapsed Into One Number

Underneath the interface, the tool is a lookup table with sixteen entries: four Business Impact levels (Critical, High, Medium, Low) crossed against four Urgency levels (Immediate, High, Medium, Low) produce exactly one severity outcome per combination, ranging from SEV1 down to SEV5. Selecting impact and urgency doesn't just return a label — each of the sixteen cells carries its own bundle of attributes: a time-to-acknowledge target, a time-to-resolve target, a communications cadence, the on-call scope that should be paged, an escalation chain, a response checklist, and a pre-filled communication template with placeholders ready to drop into Slack, a status page, or PagerDuty. The mechanism is deliberately simple — a matrix lookup rather than a scoring formula — because during a live incident, the last thing a responder needs is an algorithm to argue with.

A Matrix vs. Just Trusting Your Gut

The obvious alternative to a published matrix is letting whoever is on-call decide severity by feel, and most teams start out doing exactly that. It works fine until the moment it doesn't: two different engineers facing comparable outages will reach for different severities depending on how the last incident review went, how loud the reporting customer is, or how close it is to a release freeze. A matrix removes that variance by forcing the same two questions every time — how bad is the impact, how fast does this need attention — and mapping the answer deterministically, which means two engineers looking at the same facts land on the same severity even if they've never met. The trade-off is upfront cost: someone has to define what "Critical impact" actually means in concrete, measurable terms before the matrix is trustworthy, whereas gut-feel classification requires no setup at all until it quietly produces its first inconsistent, disputed severity call.

The Outage That Took Forty Minutes to Get a Name

A representative scenario: a checkout service starts returning errors for roughly 8% of requests during a regional traffic spike. The on-call engineer isn't sure whether this counts as a full outage — it isn't 100% down — so the incident sits un-declared for the first twenty minutes while people discuss severity in a thread instead of paging anyone else in. By the time someone finally calls it a SEV2, escalates, and pulls in a second engineer, the error rate has climbed to 30% and the delay itself becomes a line item in the post-mortem. Run the same facts through a published impact/urgency matrix at minute one — Medium-to-High impact, Immediate urgency — and the severity, the escalation path, and the communication template are all decided before the debate ever starts, which is the entire value proposition: the matrix isn't there to make incidents less severe, it's there to stop the classification argument from becoming part of the incident.

Frequently Asked Questions

How are incident severity levels determined?

Severity is determined by crossing two independent dimensions on a matrix: Business Impact and Urgency. Business Impact describes how severely the incident affects users or business operations — Critical means a complete outage or data loss affecting all users, while Low means a minor issue affecting very few users with a workaround available. Urgency describes how quickly the issue must be resolved — Immediate means right now, while Low means it can wait for the next sprint. The intersection of these two dimensions produces the severity level. A critical impact with immediate urgency is always SEV1, while a low-impact issue with low urgency is SEV5. This two-dimensional approach prevents under-classifying slow-burn issues with high impact and prevents over-classifying urgent but minor issues.

What is a typical SEV1 response SLA?

A typical SEV1 incident requires acknowledgment within 5 minutes of the alert firing, initial stakeholder communications within 15 minutes, the engineering on-call team assembled in a war room bridge call within 15 minutes, and status page updates every 15 to 30 minutes until resolution. The incident commander role should be assigned within the first 10 minutes to ensure clear ownership of the response. After resolution, a post-mortem (blameless retrospective) should be scheduled within 24 hours while the details are fresh, and the written post-mortem document should be published within 5 business days. These timelines are guidelines — organisations with strict contractual SLAs in their customer agreements may define tighter requirements.

Should every organisation use the same severity level definitions?

No — severity definitions must be calibrated to each organisation's scale, customer commitments, and risk tolerance. A startup with 5,000 users serving a non-critical B2B tool may reasonably treat most outages as SEV2, while a payments infrastructure company with millions of transactions per hour might define SEV1 only when revenue impact exceeds a specific threshold per minute. The exact thresholds matter less than having clear, unambiguous criteria that are documented, agreed upon by leadership and engineering teams, and stable enough that on-call engineers can apply them quickly under stress without needing to escalate the classification decision itself. Review and calibrate your severity definitions at least annually, or whenever your user base or product criticality changes significantly.