A story about blameless post-mortems
Chicka Chicka Boom Boom
--
A told B and B told C, ‘I’ll meet you at the top of the coconut tree. “Whee!” said D to E F G, “I’ll beat you to the top of the coconut tree.” Chicka chicka boom boom! Will there be enough room?
You just launched a Hot new product or feature. People are Interested, and you Just got a call from Kyle in Legal. A digital Media News Outlet wants to Publish a story, and Kyle wants to make sure you’re Qualified to take the interview. Your marketing team just Released a stellar Strategy and it’s Taking off like a rocket. Usage is up and the team is feeling Victorious.
Still more — W! and X Y Z! The whole alphabet up the — Oh, no! Chicka Chicka… BOOM! BOOM!
Your servers can’t handle the load. Pagers are going off everywhere. Customer success is rattled, the executives want to know when the system will be back online, and finance is counting every penny of lost revenue while you get the incident under control. How will you and your team react?
Here’s a few tips from Chicka Chicka Boom Boom, a children’s book by Bill Martin Jr and John Archambault, illustrated by Lois Ehlert.
Skit skat skoodle doot. Flip flop flee.
The mama and papa letters in Chicka Chicka Boom Boom don’t panic. They move urgently, but confidently to help their little dears. Having plans and procedures in place before a big launch allows you to respond quickly and urgently, while still maintaining a semblance of business as usual.
As part of your launch planning process, invite a cross-functional team to imagine all the reasons your project could turn into a miserable failure. Then figure out how you can prevent those problems now, while there’s still time. This practice is called a pre-mortem exercise, and there’s lots of templates for how to run one effectively.
Identify what monitoring you’ll need in place to have an early warning system before things go BOOM! Set up PagerDuty to promptly notify key individuals on the team if they do, and maintain playbooks with step-by-step instructions so anyone can spin up more servers, reboot a database, shut down a security threat or perform other common troubleshooting actions.
Get everybody running to the coconut tree
Mamas and papas and uncles and aunts hug their little dears, then dust their pants.
Gather a cross functional team as soon as an incident is declared and give them one centralized place to communicate, whether that’s a virtual meeting, a public chat channel, or a physical war room. Make it an expectation that representatives from Operations, Engineering, and QA attend, but also ensure Product, Customer Success, and Marketing will come running. Radiate information broadly, and lean on your business partners to help craft the messaging your customers and executives need to hear about the progress you’re making to fix the issue.
Status pages like Atlassian’s and Github’s are fantastic examples to follow on how to keep your customers up to date in a scalable way. It’s even better if you can proactively inform people of known outages as they contact customer support, potentially using a tool like Intercom or in-app messaging.
Keep it blameless
“Help us up,” cried A B C.
The first letters in Chicka Chicka Boom Boom to ask for the help they need are A, B, and C, who were also, coincidentally, the first letters up the tree. One of the most powerful teaching moments of this children’s story is actually found in what’s NOT on the page. Despite A, B, and C encouraging everyone to follow them in an unsuccessful endeavor, none of the other letters stop to point fingers or blame those first three for their bruises, loose teeth, and skinned knees. Instead, the whole alphabet comes together to dust everyone off so life can go back to normal.
In organizations that embrace DevOps culture, this practice is known as a Blameless Post-mortem or Incident Review. The most popular guide on how to run this kind of review comes from Etsy’s Code As Craft blog.
Having a truly blameless post-mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:
- what actions they took at what time,
- what effects they observed,
- expectations they had,
- assumptions they had made,
- and their understanding of timeline of events as they occurred.
…and that they can give this detailed account without fear of punishment or retribution.
Those are the wise words of John Allspaw and the team at Etsy, not mine, but they and many others in the community advocate for coming together as a cross-functional team to create a timeline of what happened, prevent future incidents, and identify areas to improve the response next time. This practice builds a just culture and a learning atmosphere, where some mistakes are expected (we are all human after all), but tolerated as long as we take the time to learn and improve along the way.
You may think that blameless post-mortems are only useful to identify technical solutions. However, if your post-mortems are only uncovering missing monitoring or ways to NOT write code, you may need to dig deeper and expand your reach more broadly. Engineers and Ops folks don’t work in a bubble, and it should always be assumed they did the best they could with the information they were given and the time, tool, or budget constraints they were under. What missed requirements or poor communication lead to the failure to put in that monitoring in the first place and how can those communication channels be established? After the incident began, how can we more quickly assemble our team and get relevant information out to customers faster? Asking these questions can help improve communication and processes across the entire organization.
Last to come, X Y Z. And the sun goes down on the coconut tree…
In addition to being a great way to uncover opportunities for improvement within the organization, blameless post-mortems also provide us with a clear definition of done when resolving an incident. When all of the post-mortem action items have been resolved, we can let the sun set on that incident and focus on the new day.
Don’t be afraid to try again
A is out of bed, and this is what he said, “Dare double dare, you can’t catch me. I’ll beat you to the top of the coconut tree.”
Chicka Chicka Boom Boom ends with a brave little A racing right back up the same coconut tree. Incidents will happen. Servers will go down. Third party integrations will fail. No one is immune to system outages, but we can’t let the fear of a subsequent outage keep us from ambitiously charging right back up the tree. By creating a learning culture that prepares for incidents, rallies together, and embraces blameless post-mortems, you too can have the confidence to lead your team up the coconut tree again.
Note: I am not affiliated with the authors, illustrator, or publisher of Chicka Chicka Boom Boom. I’m just a toddler mom who’s interested in presenting Modern Agile principles and best practices in new, engaging ways.