The war is over.
Things get back to normal. There’s still smoke on the battlefield but everyone start breathing again.
You just fought a very tough battle to fix something that failed miserably. Whatever it was, production deployment failure, security breach or missed unmissable sales process, the crisis if finally over. Still, you’re not done yet. As everyone congratulate each other, or silently leave the war room to breath some fresh air after a sleepless night, there’s one thing to do while things are still fresh in your mind.
The post-mortem analysis.
You’re tired and thinking « I’ll do it tomorrow, first, I need to sleep ». That’s a lie and you don’t know about it. I know it from experience. If you don’t do it now, if you don’t share it now, you won’t do it at all.
Considered as the boring part of the job, the post-mortem analysis is as critical as fixing the situation was. It’s a summary of what happened and what was done to fix it. Building a comprehensive knowledge base, it helps the whole company from not doing the same mistake twice and ensures someone will remember about it after you’re gone.
In corporation, post-mortem analyses are made of dozens complicated documents you need 3 levels of management to approve. In startups, they must be findable, practical and reusable.
I’m pretty sure ops methodologies have a perfect and complicated canvas for those, maybe an ISO stuff. I tend to keep them dead simple. I split my post-mortem analyses into 4 parts:
- What happened.
- How it was detected and fixed.
- Why it happened.
- What is the plan to ensure it won’t happen anymore.
What happened is a comprehensive story of the event describing the whole thing from the beginning until the end. I try to write it as a story to make it easier to read, adding time frame when available.
How it was detected and fixed is the operational part. It’s the HOWTO in case the same thing or something similar happens again.
Why it happened is the only part that’s often incomplete when the analysis is shared for the first time. The 5 whys that helps finding the root cause of something often needs time to be answered.
What is the plan to ensure it won’t happen anymore is the real TODO you’ll have to find time for in the next days or weeks.
Write it on the medium you prefer except mail. Mail is terrible to keep information live in the company. It can be your enterprise social network if you have one, it can be a wiki or it can be your favorite bug tracking issues.
I like using Github issues for that, and assigning them a postmortem tag because other people can comment after I publish it and I can close them when we consider there’s nothing to add.
I also sometimes blog about the most important ones when I know they can be of some public interest even though they show how I failed at some point.
Failing is part of the game, that’s why post-mortem analysis help not making the same mistake twice.