Back to blog
|15 min read|Docsio

Incident Response Template: Copy-Paste SaaS Playbook

incident-responseincident-managementengineering-processdocumentation-template
Incident Response Template: Copy-Paste SaaS Playbook

When the database goes sideways at 2 a.m., nobody is reading a 79-page NIST PDF. Your engineers need an incident response template they can scan in 30 seconds, paste into Slack, and act on. The post-mortem comes later. Right now, the site is down, and someone has to declare the severity, page the right people, and start writing customer comms before the support inbox catches fire.

This post is the active-response playbook, not the post-mortem template you'll fill out tomorrow. Below is a copy-paste incident response template covering severity definitions, the incident commander role, internal Slack format, customer-facing status updates, the decision log, escalation rules, and a handoff section that hands cleanly to your retro process. We'll walk through one real SEV1 (database read replica failure) so you can see how the template actually plays out, then compare it with a generic runbook template so you know which one you actually need.

What an incident response template is for

An incident response template is the document your team opens the moment something is broken. It tells whoever is on call: how bad is this, who's running it, who's writing customer comms, what we tried, what we decided, and what we hand off when the next person rotates in.

It is not a policy document. It is not a compliance binder. It is the live working file for one incident, opened from a template, filled out as the incident unfolds, and saved to your docs site when the dust settles.

The template fixes three problems that kill response speed:

  1. Decision paralysis. Without pre-written severity tiers, engineers waste 10 minutes arguing whether a degraded checkout flow is SEV1 or SEV2. With them, it takes 30 seconds.
  2. Comms drift. Without templated customer wording, the marketing person posts something to the status page that contradicts what an engineer just told a customer over email.
  3. Lost context on handoff. Without a structured decision log, the engineer who comes online at hour three has to re-read 400 Slack messages to figure out what's already been ruled out.

The 271% search-volume growth on this keyword over the last quarter is mostly SaaS engineering teams realizing the cybersecurity-flavored NIST templates don't fit their workflow. They need something tighter.

Incident response template vs post-mortem template vs runbook

Three documents, three jobs. Teams confuse them constantly.

DocumentWhen you open itWhat it capturesWho owns it
Incident response templateDURING the incident, minute oneSeverity, comms, decisions, timeline as it happensIncident commander
Post-mortem templateAFTER the incident, 24-72h laterRoot cause, contributing factors, action itemsEngineering lead + IC
Runbook templateWhen a known scenario triggersStep-by-step recovery for one specific failure modeService owner

The incident response template is the active-response wrapper. Inside it, an engineer might execute a runbook ("read replica failover") because that runbook exists for that exact failure. After resolution, the IC opens a post-mortem template to analyze what happened. All three live in the same docs system but get used at different moments.

If your team has only one of these, build the incident response template first. It's the one that actually saves you during the incident.

The incident response template (copy-paste)

Drop this directly into your docs system. Fill in the bracketed fields when an incident opens.

# Incident: [Short descriptive title]

**Status:** [Open | Mitigated | Resolved]
**Severity:** [SEV1 | SEV2 | SEV3]
**Started:** [UTC timestamp when impact began]
**Detected:** [UTC timestamp when first alert/report]
**Resolved:** [UTC timestamp when fully restored, leave blank until done]
**Incident commander:** [Name + Slack handle]
**Comms lead:** [Name + Slack handle]
**Scribe:** [Name + Slack handle]
**Slack channel:** #inc-[YYYY-MM-DD]-[short-slug]
**Status page incident:** [URL]

## Summary
[One paragraph, updated as you learn more. Plain language. What is broken, who is affected, what is the current mitigation status.]

## Impact
- **Customers affected:** [percentage / count / segments]
- **Services affected:** [list of services]
- **Customer-visible symptoms:** [what the user sees]
- **Revenue impact (estimate):** [if known]

## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:02 | First PagerDuty alert: read replica lag >30s |
| 14:04 | IC declared, comms lead assigned |
| 14:07 | Status page posted: "investigating" |
| 14:18 | Confirmed root cause: replica disk full |
| 14:22 | Failover initiated |
| 14:26 | Failover complete, latency normal |
| 14:35 | Status page: "resolved", monitoring |

## Decision Log
| Time | Decision | Rationale | Owner |
|------|----------|-----------|-------|
| 14:09 | Hold off on full failover until we confirm replica vs primary | Premature failover risks data loss | IC |
| 14:18 | Trigger failover to standby replica | Disk-full confirmed, no other path | IC |

## Actions Taken
- [What we tried, what worked, what didn't]

## Communications
- Internal Slack: see #inc-... channel
- Customer status page: [link to incident]
- Customer email: [sent / not sent / draft link]
- Account exec outreach to top-10 affected customers: [owner, status]

## Handoff Notes
[If IC is rotating: current state, what to watch for, who is paged in, next decision point]

## Links
- Runbook used: [URL]
- Dashboards: [URLs]
- Related incidents: [URLs]
- Post-mortem: [URL, fill in after retro]

That's the entire active document. Everything else (the severity definitions, the comms templates, the escalation tree) lives separately as reference docs that the incident commander pulls from. Below are those pieces.

Severity definitions (SEV1 to SEV3)

Pre-define these. If your team is debating severity in the moment, you've already lost five minutes. PagerDuty's State of Digital Operations report found teams with documented severity criteria mitigate incidents around 30% faster than teams without them.

SEV1: Critical

  • Customer-facing service is down or severely degraded for a meaningful share of users
  • Data loss, security breach, or financial transaction failure
  • Response SLA: page within 5 minutes, IC assigned within 10 minutes, public status page update within 15 minutes
  • All-hands posture: IC + comms lead + scribe + relevant service owners online, no other engineering work happens until mitigated

SEV2: High

  • Partial service degradation, important feature broken, but workarounds exist
  • Affects a single customer segment or non-critical surface
  • Response SLA: page within 15 minutes, IC assigned within 30 minutes, status page update within 30 minutes if customer-visible
  • Focused team, but the company keeps working

SEV3: Moderate

  • Minor bug, internal-only impact, or external impact with full workaround
  • No status page entry required unless customer reports
  • Response SLA: ticket filed, addressed in normal sprint cadence
  • One engineer can handle it

Don't add SEV4 and SEV5. They become a place to bury things. If it's not at least a SEV3, it's a regular bug ticket.

Roles during an incident

Three roles. Hard requirement on SEV1, recommended on SEV2.

Incident Commander (IC). One person. Owns the response, makes calls, assigns work. The IC does NOT debug. Their job is coordination. If your most senior engineer is also the only person who can fix the bug, they should NOT be IC. Get someone else to coordinate while the senior engineer focuses on the fix.

Comms Lead. Writes the status page updates. Drafts the customer email. Coordinates with support, success, and marketing on external messaging. Owns the rule that no two channels say different things.

Scribe. Updates the timeline and decision log in real time. The IC will be on calls and in DMs. Without a scribe, the timeline gets reconstructed from memory at 4 a.m. and is wrong.

For a SEV1, those three roles are filled by three different people. Do not let the IC also scribe.

Internal Slack channel format

Open #inc-YYYY-MM-DD-short-slug (e.g., #inc-2026-05-04-replica-lag) at incident declaration. Pin this format as the topic:

Status: INVESTIGATING | MITIGATED | RESOLVED
IC: @name  Comms: @name  Scribe: @name
Doc: [link to incident response template]
Status page: [link]
Next update: [UTC time]

Update the topic every time status changes. The first message in the channel is the IC posting:

:rotating_light: SEV1 declared.

Impact: Checkout API returning 500 for ~40% of requests since 14:02 UTC.
IC: @aidan
Comms: @sam
Scribe: @priya
Doc: [link]

Next update in this channel: 14:20 UTC.

Updates land in the channel every 15 minutes for SEV1, every 30 for SEV2, even if nothing has changed. "Still investigating, no new findings, next update at 14:35" is a valid update. Silence creates panic.

Customer-facing status page wording

Three states, three templates. Edit only the bracketed fields.

Investigating (initial post within 15 minutes of SEV1):

We're investigating an issue affecting [feature/area]. Some users may experience [specific symptom]. We're actively working on a fix and will post an update within 30 minutes.

Identified (when you know what's wrong but it isn't fixed yet):

We've identified the cause of the [feature/area] issue and are deploying a fix. Affected users may continue to see [symptom] until the fix is live. ETA [time].

Resolved:

The issue affecting [feature/area] is resolved as of [time UTC]. All systems are operating normally. We're conducting a full review and will publish a post-mortem within [3-5 business days]. We're sorry for the disruption.

Three rules for customer comms:

  1. Never speculate about cause until you've confirmed it. "We're investigating reports of slow page loads" is fine. "We believe a third-party API is degraded" is not, until you've verified.
  2. No internal jargon. Customers don't care that the read replica's WAL fell behind. They care that checkout is broken.
  3. Always commit to a follow-up. A post-mortem timeline. A retro link. A direct email to top-affected accounts. Silence after a SEV1 is the thing that loses customers.

Escalation path

Pre-define who to page when the on-call engineer can't resolve a SEV1 within 30 minutes.

Tier 1 (0-30 min):  On-call engineer for the affected service
Tier 2 (30-60 min): Service tech lead + on-call IC
Tier 3 (60+ min):   Engineering manager + VP Eng
SEV1 with security implication: + CISO immediately
SEV1 with data loss possibility: + CTO + Legal immediately
SEV1 lasting > 4h: CEO informed

The point of the table is to remove the awkward "should I wake them up?" decision. The table already decided. If we're at minute 65 and not resolved, the engineering manager gets paged. No discussion.

Decision log: why this matters more than people think

Half the value of an incident response template is the decision log. Most timelines just list events: "14:18, failover initiated." That tells you nothing useful in the post-mortem.

A decision log records the call AND the reasoning. "14:18, failover initiated. Disk-full on primary replica confirmed. Held off until this point because we wanted to verify the standby was healthy before cutting over. Risk: 30s of write unavailability during cutover. IC: @aidan."

Three months later, when a similar incident happens, the next IC reads that and skips the same investigation. That's compounding institutional knowledge. Without the log, every incident is a fresh start.

This is also why incident response templates only get used if engineers can find them in 10 seconds. A playbook buried in a Confluence page nobody can locate is the same as no playbook. Teams that publish their incident response template, runbooks, and severity definitions to a fast, searchable internal docs site recover noticeably faster, often on the order of 5x mean time to recovery in our experience watching teams onboard. Docsio handles this for SaaS teams that don't want to build the docs site itself, generating a branded, hosted site from your existing content with full-text search out of the box. The next step is the post-mortem template you'll link from each closed incident.

Worked example: SEV1 database read replica failure

Let's run a real one through the template. 14:02 UTC, Tuesday.

14:02. PagerDuty fires. Read replica lag exceeds 30 seconds. Read queries from the API are timing out. Checkout endpoint returns 500 for ~40% of requests.

14:04. On-call engineer Priya acknowledges. Checks dashboards. Confirms replica lag is climbing, not stable. Declares SEV1 in #inc-2026-05-04-replica-lag. Pages Aidan as IC. Sam picks up comms lead.

14:07. Sam posts the "investigating" status page update. IC asks scribe to log timeline. Priya digs into replica logs.

14:09. Decision logged: "Hold off on failover. Primary may still recover. Risk of premature failover is data divergence. IC: Aidan." Team agrees.

14:14. Priya finds it. Replica disk is at 99% full. WAL files have stopped applying. The replica isn't catching up; it's stalling.

14:18. Decision logged: "Failover to standby. Disk-full confirmed, replica won't recover without intervention. Risk: 30s write blip. IC: Aidan." Failover triggered.

14:22. Failover complete. Replica lag drops to zero on standby. API latency back to baseline. Sam posts "identified, fix deploying" status update.

14:26. Five-minute monitoring window clean. Sam posts "resolved" on status page. IC announces resolution in Slack channel.

14:35. IC posts handoff note in the incident doc:

Resolved. Root cause: disk-full on primary replica because retention policy on a new audit-log table wasn't set. WAL files accumulated. Fix: failover to standby (now primary), original primary being rebuilt. Post-mortem scheduled for Thursday 10:00 with @priya owning. Runbook for "read replica disk full" already exists, was used. Recommend updating runbook to include the disk-full alert threshold check.

The whole incident: 33 minutes from alert to resolved. The decision log shows two key calls and why they were made. The post-mortem on Thursday writes itself, because the timeline and decisions are already captured.

Common mistakes that gut your template

Templates fail in predictable ways.

  • Severity tiers nobody actually uses. If every incident gets declared SEV2 because nobody wants the SEV1 paperwork, the tiers are too punishing. Lighten the SEV1 process or accept that you'll under-report.
  • Comms updates that go out without IC sign-off. A well-meaning support manager posts "we think it's a third-party issue" before the IC has confirmed. Now the comms are wrong and you have to retract. Hard rule: comms lead drafts, IC approves, status page goes live.
  • Decision log treated as optional. Engineers skip it because "we'll remember." They won't. The log has to be the scribe's only job during the incident.
  • No definition of "resolved." Some teams call resolved at the moment the fix deploys. Others wait for a 15-minute clean monitoring window. Pick one. Write it down. Without a definition, status pages get marked resolved while customers are still seeing errors.
  • Template lives in five places. Confluence, Notion, a Google Doc, an old wiki, the README. Engineers can't find it. Pick one home and link to it from PagerDuty alerts.
  • Treating it as a compliance artifact. If the only time anyone opens the template is during an audit, your team isn't using it. The template should feel like a living working file.

Plug it into the rest of your engineering docs

The incident response template doesn't live alone. It connects to:

If you've never written one of these, start with the incident response template. It pays back the fastest. Then add a post-mortem template. Then add runbooks for your top three failure modes. Three documents covers 80% of the value.

Frequently asked questions

What's the difference between an incident response template and an incident response plan?

The plan is the policy: severity definitions, roles, escalation rules, communication standards. The template is the live working document for one specific incident, structured by the plan. You write the plan once, you fill in a template every incident. Most SaaS teams need a one-page plan and a clear template, not a 79-page binder.

Do small SaaS teams actually need this?

Yes, earlier than founders expect. Once you have paying customers and a status page, you have incidents. The first SEV1 without a template is the one that teaches you why it matters: nobody writes anything down, customer comms contradict each other, and the post-mortem is reconstructed from memory. The template is two hours to set up and saves a full day on every future incident.

How is this different from a NIST or CISA incident response template?

NIST and CISA templates target enterprise cybersecurity programs, regulated industries, and breach response. They're heavy on policy, compliance, and forensic preservation. A SaaS incident response template targets engineering operations: severity tiers tied to user impact, fast comms loops, and integration with on-call tooling. Different audiences, different formats. If you have a security incident, a NIST template still applies; for a database outage, the SaaS-engineering format fits better.

Should the on-call engineer also be the incident commander?

For SEV3 and small SEV2 incidents, yes. For SEV1, no. The on-call engineer is the one with their hands on the system. The IC is the coordinator. One brain shouldn't do both during a high-pressure SEV1. Most mature teams rotate IC separately from on-call, or have a senior engineer step in as IC when SEV1 declares.

How long should an incident response template be?

The active document for one incident is usually one page. The reference docs (severity definitions, escalation tree, comms templates) are another two to three pages combined. If your template is more than five pages of "active fields," you've turned it into a compliance binder. Cut it down.

Ready to ship your docs?

Generate a complete documentation site from your URL in under 5 minutes.

Get Started Free