Back to blog
|15 min read|Docsio

Post Mortem Template: Sections, Examples, and How to Use It

post-mortemincident-responseengineering-processdocumentation-template
Post Mortem Template: Sections, Examples, and How to Use It

A good post mortem template forces a team to slow down and answer the questions that matter after an outage: what broke, who was affected, why it happened, and what changes by Friday. A bad one collects screenshots, blames a junior engineer, and gets filed somewhere nobody looks again. The template you use shapes which one you write, so the template is worth getting right. Pair this with the runbook template we published earlier, and you have the two documents most engineering teams actually need: the runbook for the incident, the post mortem for the lessons.

This post gives you a copy-and-paste post mortem template, walks through every section, and shows two worked examples from real-shaped outages. We also cover the difference between a post mortem, a root cause analysis (RCA), and an incident report, because those terms get used interchangeably and they shouldn't be.

What a post mortem template is for

A post mortem template is a structured document teams fill out after an incident to capture what happened, what was learned, and what will change as a result. It is not a status update. It is not a blame document. It is a permanent record that lets the next on-call engineer understand the failure mode without having to ping anyone on Slack.

The template matters because incident memory decays fast. A week after the outage, the timeline is fuzzy. A month later, the log links are dead. A quarter later, half the responders have rotated off the team. A consistent post mortem template captures the details while they are still fresh, in a format the next reader can scan in two minutes.

Manual post-mortem reconstruction wastes 60-90 minutes per incident as teams scroll through Slack history, monitoring tools, and call recordings (incident.io, 2025). A template won't eliminate that work, but it does make sure the work produces something useful at the end.

Post mortem vs RCA vs incident report

These three terms get blurred together. They are different documents with different audiences.

DocumentWritten forLengthFocus
Incident reportCustomers, status page readers, account managers1-3 paragraphsWhat happened, what we did about it, apology
Root cause analysis (RCA)Engineering leadership, auditors, compliance1-2 pagesThe technical chain of causation, the fix
Post mortemThe engineering org, future on-call engineers1-3 pagesEverything: timeline, contributors, mitigators, action items, lessons

The post mortem is the most complete of the three. The incident report is what you publish externally. The RCA is what you attach to a JIRA ticket. The post mortem is the document the team learns from and the one that gets re-read when a similar pattern shows up six months later. If you only write one of these, write the post mortem.

A separate doc you may also need is the architecture decision record, which captures pre-decision reasoning. Post mortems are post-incident; ADRs are pre-decision. Both belong in the same shared docs system but answer different questions.

The post mortem template

Copy this directly into your docs system and edit per incident. Sections are ordered so the most important context (summary, impact) lands first.

# Post-mortem: [Short incident title]

**Status:** Draft / Final
**Author:** [Name]
**Reviewers:** [Names]
**Incident date:** YYYY-MM-DD
**Severity:** SEV-1 / SEV-2 / SEV-3
**Duration:** [HH:MM, UTC start to UTC end]
**Services affected:** [List]

## Summary

Two or three sentences. What broke, who was affected, how long it lasted,
how it was resolved. Written so a stakeholder outside the team can read it
in 30 seconds and understand the incident.

## Impact

- **Users affected:** [Number / percentage]
- **Customer-visible symptoms:** [What users saw]
- **Revenue impact:** [If known, in dollars or transactions]
- **Support load:** [Tickets opened, social mentions]
- **SLA / SLO breach:** [Yes / No, which]

## Timeline

All times in UTC. Include the leadup events, detection, response,
and recovery steps. Link to Slack threads, dashboards, and PRs.

| Time (UTC) | Event |
|------------|-------|
| HH:MM | [Trigger event, e.g. deploy of v2.3.1] |
| HH:MM | [First alert fires] |
| HH:MM | [On-call engineer paged] |
| HH:MM | [Incident declared, war room opened] |
| HH:MM | [Root cause identified] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Service restored, incident closed] |

## Root cause

The technical reason this incident happened. One paragraph. Use the
five-whys technique to walk from the symptom back to the underlying cause.

## Contributing factors

Things that made the incident worse or more likely. Examples: missing
alerting, stale runbook, on-call rotation gap, unreviewed config change.
List 3-7 of these. Not every contributor is a "root cause" but each one
is something to consider improving.

## What went well

What the team did right during the response. Often skipped, always
worth filling in. Examples: alert fired within 60 seconds, mitigation
was a clean rollback, incident channel was used correctly.

## What didn't go well

What slowed the response or made things worse. Examples: dashboard
was paging the wrong team, no runbook for this service, alert
threshold was too noisy and got ignored.

## Action items

Each item gets an owner and a due date. Tracked in your ticketing
system. Stale action items mean the post mortem didn't take.

| Action | Owner | Due | Ticket |
|--------|-------|-----|--------|
| [Specific change to ship] | [Name] | YYYY-MM-DD | [Link] |

## Lessons learned

What the team takes forward beyond the action items. Patterns,
risks, gaps in knowledge or tooling. This is the section future
readers re-read when a similar failure happens.

Walking through each section

The template is short on purpose. Every field that ships uncompleted is one less reason to read the document later. Here is what each section is doing.

Summary

Stakeholders outside engineering will only read the summary. Write it like a journalist's lede. Pattern: On 2026-04-22, between 14:12 and 14:59 UTC, our checkout service returned 500 errors for approximately 38% of requests due to a database connection pool exhaustion triggered by a deploy at 13:58 UTC. The deploy was rolled back; no data was lost. Three sentences. Names what, when, how big, why, and how it ended.

Impact

Specific numbers beat vague claims. 38% of requests failed for 47 minutes lands harder than most users had problems for almost an hour. If revenue impact is calculable, calculate it. If you can't tell yet, say so and update the doc when the number is in. Specificity is what makes the post mortem useful for prioritization meetings later.

Timeline

UTC, always. Time zones are how teams forget to talk to their European on-call. The timeline should include events the response team didn't see in real time but figured out afterward, like "config change shipped 9 days before the incident introduced the latent bug." Link every row to the Slack message, dashboard, or PR that proves it. The timeline is the part of the post mortem that gets cited when a similar incident happens later, so accuracy and links matter more than prose.

Root cause and contributing factors

Use the five-whys to get from symptom to root cause. The Atlassian incident handbook frames this well: each "why" should reveal a deeper layer until you hit something structural. Example chain:

  1. The application returned 500s because the database connection pool was exhausted.
  2. The pool exhausted because connections weren't being released after the new ORM call.
  3. The ORM call wasn't releasing because the new release wrapped queries in a context that bypassed the existing release logic.
  4. The bypass wasn't caught because integration tests don't measure connection pool state.
  5. We don't measure connection pool state in tests because we never had a connection-leak incident before.

That chain points at three different fixes (the ORM bug, the test gap, the metric gap), not just the obvious one. Listing contributing factors separately keeps the post mortem from collapsing onto a single root cause that is too convenient to be true.

What went well, what didn't

The "went well" section gets skipped when teams are tired or embarrassed. Don't skip it. The point isn't to soften the tone. It is to surface what to keep doing. If your alert fired in 30 seconds and your incident commander assignment was clean, that is a system to defend, not a footnote. The "didn't go well" section is the honest version of the same exercise. If the runbook was missing, say so. If the dashboard paged the wrong team, say that too.

Action items

Action items are the part that decides whether the post mortem was worth writing. Each item needs an owner, a due date, and a ticket. Items without those fields get dropped. Items without due dates slip indefinitely. A post mortem that closes with five action items, of which three are still open six months later, is a sign the post mortem ritual has stopped working.

Cap action items at 5-7 per incident. More than that and none get done. If you find yourself with 12 follow-ups, group them into themes and pick the top ones. The rest go into the "lessons learned" section as known risks, not commitments.

Lessons learned

This is the section future engineers read. Action items are this incident's punch list; lessons learned are the patterns. "Latent config changes that don't show effect for a week are hard to attribute" is a lesson. "We need to add connection pool metrics" is an action item. Both belong in the doc, but they belong in different sections.

Worked example 1: cache stampede took down checkout

Summary. On 2026-03-14, between 09:42 and 10:29 UTC, checkout returned errors for 91% of requests for 47 minutes due to a cache stampede when the Redis cluster failed over. A configuration change shipped 11 days earlier had reduced the cache TTL from 600s to 60s, increasing miss rate. When Redis failed over, every request hit the database directly. The fix was scaling the database read replicas; the underlying TTL change was reverted.

Root cause. Cache TTL was reduced as part of unrelated work on a different feature, and the change wasn't flagged as having capacity implications. When Redis failed over, the database couldn't absorb the spike.

Contributing factors.

  • Cache TTL changes don't go through capacity review.
  • Database read replica autoscaling has a 4-minute warmup, longer than the incident window.
  • No alert on cache hit ratio, only on Redis health.
  • The runbook for "Redis failover" assumes cache TTL is 10 minutes.

Action items.

  • Add cache hit ratio to the Redis dashboard. Owner: Priya. Due: 2026-03-21.
  • Add a check in the deploy pipeline for TTL changes on hot keys. Owner: Marco. Due: 2026-03-28.
  • Update the Redis failover runbook to include database read replica scaling. Owner: Kim. Due: 2026-03-22.

Lesson learned. Cache TTL is a capacity decision dressed up as a config tweak. Treat it accordingly.

Worked example 2: a migration ran twice

Summary. On 2026-04-08, a database migration ran twice in production, leaving 18,000 user records with duplicate billing_address rows. No customer was charged twice, but support handled 142 tickets over 36 hours. The duplicates were reconciled with a backfill script.

Root cause. The migration tool retried the migration after a 30-second timeout, but the first run had succeeded just past the timeout. There was no idempotency check.

Contributing factors.

  • The migration tool's retry behavior wasn't documented in the deploy runbook.
  • No idempotency convention for schema migrations.
  • The migration was run during a deploy with three other migrations, masking the duplicate runs in logs.
  • Detection was reactive, via support tickets, not proactive via monitoring.

Action items.

  • All migrations require an idempotency guard (IF NOT EXISTS or equivalent). Owner: data-platform team. Due: 2026-04-22.
  • Disable migration retries in the deploy tool, fail loudly instead. Owner: Lin. Due: 2026-04-15.
  • Add a row-count assertion at the end of every migration. Owner: data-platform team. Due: 2026-04-29.

Lesson learned. Retries on irreversible operations are a class of bug, not a single bug. We should audit other tools that retry.

Blameless culture, briefly

The phrase "blameless post mortem" gets quoted often and applied unevenly. The intent is straightforward: the goal of a post mortem is to learn what failed in the system, not to identify who to fire. Naming individuals is fine when it adds context ("the on-call engineer was paged at 02:14 and acknowledged at 02:17"). Implying the engineer caused the outage by being on call is not.

Two practical rules. First, replace "Engineer X caused the outage by deploying a bad config" with "A config change deployed at 13:58 introduced the bug. The change passed review and CI." The second framing is true and complete; the first one is editorial. Second, if a single person's action triggered the incident, the post mortem should ask why the system allowed that action without a guardrail. The answer is almost always "because we hadn't built the guardrail yet," and that is what the action items address.

Blameless doesn't mean responsibility-free. Action items have owners. People are accountable for fixing what broke. The blameless part is that the post mortem isn't where accountability gets assigned for the original incident; the action items are where accountability gets assigned for what changes next.

Where post mortems live matters

A post mortem that nobody can find six months later is a journal entry. A post mortem that the next on-call engineer can search for when they see a similar alert is institutional memory.

Most teams default to Notion or Google Docs for post mortems, then watch them disappear into folder hierarchies. The fix is to put them in a searchable, linked-up internal docs site, alongside runbooks, ADRs, and the SOPs the team already uses. That way the post mortem from March 2026 shows up when an engineer in October searches for "Redis failover" or "cache stampede." Tools like Docsio can host this kind of internal docs archive without your team having to build the site.

The point isn't the specific tool. The point is that post mortems compound when they are searchable, cross-linked, and read. If your team's last 12 post mortems are scattered across three tools and two folder structures, you are paying the documentation tax without earning the institutional memory.

For broader guidance on how internal docs should be organized, our piece on internal documentation covers the structure most engineering teams settle on.

Common post mortem mistakes

A few patterns make post mortems worse than not writing them at all.

  1. Writing it three weeks later. Memory has decayed. Slack history is harder to reconstruct. The on-call engineer has rotated off. Post mortems should be drafted within 5 business days. PagerDuty's standard template names this explicitly.
  2. Listing 20 action items. None of them get done. Cap at 5-7 and put the rest in lessons learned.
  3. Treating "human error" as a root cause. Human error is always upstream of a missing guardrail. Keep asking why until you find the system gap.
  4. Skipping the "what went well" section. You will lose the patterns worth defending.
  5. Not linking the timeline to evidence. A timeline without links to dashboards and Slack threads is a story, not a record.
  6. Filing the doc in a place no one searches. Same as not writing it.

The shape of the document also matters less than whether your team actually fills it out after every SEV-2 or higher. A simple consistent template that gets used beats a sophisticated one that gets skipped. If yours feels too heavy, cut sections. If too light, add them. Then leave it alone for a quarter.

For a similar template aimed at the operational side of the same workflow, see our runbook template. Runbooks tell you what to do when a known failure happens; post mortems are where you decide whether the failure should still be possible at all. Together with release notes and changelogs, they form the operational documentation backbone most engineering teams rely on, regardless of whether they call it that.

FAQ

What sections should a post mortem template include?

A useful post mortem template covers summary, impact, timeline, root cause, contributing factors, what went well, what didn't go well, action items, and lessons learned. Some teams add a responders section with names of the incident commander and scribe. Skip sections that consistently come back empty for your team rather than padding them.

How long should a post mortem be?

Most post mortems land between 1 and 3 pages, or 600-1,500 words. Severity drives length. A SEV-3 with a clean rollback might be half a page; a SEV-1 customer-visible outage with multiple contributing factors needs more. Length should match what is genuinely worth recording, not a fixed page count.

Who should write the post mortem?

Usually the incident commander or the on-call engineer who led the response. The author drafts within 5 business days, then circulates for review with everyone who responded. Reviewers add corrections and missing context. The post mortem is finalized once action items have owners and due dates.

What is the difference between a post mortem and a retrospective?

A post mortem is incident-specific and triggered by a failure. A retrospective is recurring and team-wide, usually run at the end of a sprint or project. Both produce action items, but post mortems are reactive and narrow, while retrospectives are scheduled and broad. Some teams use the words interchangeably; in engineering, post mortem usually means an incident review.

Should post mortems be public?

Customer-facing incidents often warrant a short public incident report on the status page, but the full internal post mortem usually stays internal. Some teams publish blameless versions externally as a trust signal. The internal version is more candid, includes names and Slack links, and is meant for the engineering org rather than customers.

Ready to ship your docs?

Generate a complete documentation site from your URL in under 5 minutes.

Get Started Free