Introduction

In the summer of 2011 I began my own personal journey into learning from failure. As the Director of Tech Support for a startup in Boulder, Colorado, I managed all inbound support requests regarding the service we were building: a platform to provision, configure, and manage cloud infrastructure and software.

One afternoon I received a support request to assist a customer who wished to move their instance of Open CRM from Amazon Web Services (AWS) to a newer cloud provider known as Green Cloud whose infrastructure as a service was powered by “green” technologies such as solar, wind, and hydro. At that time, running an instance similar in size was significantly more cost effective on Green Cloud as well.

Transferring applications and data between cloud providers was one of the core selling points of our service, with only a few clicks required to back up data and migrate to a different provider. However, occasionally we would receive support requests when customers didn’t feel like they had the technical skills or confidence to make the move on their own. In this case, we established a date and time to execute the transition that would have the lowest impact on the customer’s users. This turned out to be 10 p.m. for me.

Having performed this exact action many times over, I assured the customer that the process was very simple and that everything should be completed in under 30 minutes. I also let them know that I would verify that the admin login worked and that the MySQL databases were populated to confirm everything worked as expected.

Once the transfer was complete, I checked all of the relevant logs; I connected to the instance via SSH and stepped through my checklist of things to verify before contacting the customer and closing out the support ticket. Everything went exactly as expected. The admin login worked, data existed in the MySQL tables, and the URL was accessible.

When I reached out to the customer, I let them know everything had gone smoothly. In fact, the backup and restore took less time than I had expected. Recent changes to the process had shortened the average maintenance window considerably. I included my personal phone number in my outreach to the customer so that they could contact me if they encountered any problems, especially since they would be logging in to use the system several hours earlier than I’d be back online—they were located in Eastern Europe, so would likely be using it within in the next few hours.

Incident Detection

Within an hour my phone began blowing up. First it was an email notification (that I slept through). Then it was a series of push notifications tied to our ticketing system, followed almost immediately by an SMS from the customer. There was a problem.

After a few back-and-forth messages in the middle of the night from my mobile phone, I jumped out of bed to grab my laptop and begin investigating further. It turned out that while everything looked like it had worked as expected, the truth was that nearly a month’s worth of data was missing. The customer could log in and there was data, but it wasn’t up to date.

Incident Response

At this point I reached out to additional resources on my team to leverage someone with more experience and knowledge about the system. Customer data was missing, and we needed to recover and restore it as quickly as possible, if that was possible at all. All the ops engineers were paged, and we began sifting through logs and data looking for ways to restore the customer’s data, as well as to begin to understand what had gone wrong.

Incident Remediation

Very quickly we made the horrifying discovery that the backup data that was used in the migration was out of date by several months. The migration process relies on backup files that are generated every 24 hours when left in the default setting (users, however, could make it much more frequent). We also found out that for some reason current data had not been backed up during those months. That helped to explain why the migration contained only old data. Ultimately, we were able to conclude that the current data was completely gone and impossible to retrieve.

Collecting my thoughts on how I was going to explain this to the customer was terrifying. Being directly responsible for losing months’ worth of the data that a customer relies on for their own business is a tough pill to swallow. When you’ve been in IT long enough, you learn to accept failures, data loss, and unexplainable anomalies. The stakes are raised when it impacts someone else and their livelihood. We all know it could happen, but hope it won’t.

We offered an explanation of everything we knew regarding what had happened, financial compensation, and the sincerest of apologies. Thankfully the customer understood that mistakes happen and that we had done the best we could to restore their data. With any new technology there is an inherent risk to being an early adopter, and this specific customer understood that. Accidents like this are part of the trade-off for relying on emerging technology and services like those our little tech startup had built.

After many hours of investigation, discussion, and back and forth with the customer, it was time to head in to the office. I hadn’t slept for longer than an hour before everything transpired. The result of my actions was by all accounts a “worst-case scenario.” I had only been with the company for a couple of months. The probability of being fired seemed high.

Incident Analysis

Once all of the engineers, the Ops team, the Product team, and our VP of Customer Development had arrived, our CEO came to me and said, “Let’s talk about last night.” Anxiously, I joined he and the others in the middle of our office, huddled together in a circle. I was then prompted with, “Tell us what happened.”

I began describing everything that had taken place, including when the customer requested the migration, what time I began the process, when I was done, when I reached out to them, and when they let me know about the data loss. We started painting a picture of what had happened throughout the night in a mental timeline.

To cover my butt as much as possible, I was sure to include extra assurances that I had reviewed all logs after the process and verified that every step on my migration checklist was followed, and that there were never any indications of a problem. In fact, I was surprised by how quickly the whole process went.

Having performed migrations many times before, I had a pretty good idea of how long something like this should take given the size of the MySQL data. In my head, it should have taken about 30 minutes to complete. It actually only took about 10 minutes. I mentioned that I was surprised by that but knew that we had recently rolled out a few changes to the backup and restore process, so I attributed the speediness of the migration to this new feature.

I continued to let them know what time I reached out to the Ops team. Although time wasn’t necessarily a huge pressure, finding the current data and getting it restored was starting to stretch my knowledge of the system. Not only was I relatively new to the team, but much about the system—how it works, where to find data, and more—wasn’t generally shared outside the Engineering team.

Most of the system was architected by only a couple of people. They didn’t intentionally hoard information, but they certainly didn’t have time to document or explain every detail of the system, including where to look for problems and how to access all of it.

As I continued describing what had happened, my teammates started speaking up and adding more to the story. By this point in our mental timeline we each were digging around in separate areas of the system, searching for answers to support our theories regarding what had happened and how the system behaved. We had begun to divide and conquer with frequent check-ins over G-chat to gain a larger understanding about the situation from each other.

I was asked how the conversation went when I reached out to the customer. We discussed how many additional customers might be affected by this, and how to reach out to them to inform them of a possible bug in the migration process.

Several suggestions were thrown out to the Operations team about detecting something like this sooner. The engineers discussed adding new logging or monitoring mechanisms. The Product team suggested pausing the current sprint release so that we could prioritize this new work right away. Everyone, including the CEO, saw this as a learning opportunity, and we all walked away knowing more about:

How the system actually worked
What problems existed that we were previously unaware of
What work needed to be prioritized

In fact, we all learned quite a bit about what was really going on in our system. We also gained a much clearer picture of how we would respond to something like this. Being a small team, contacting each other, and collaborating on the problem was just like any other day at the office. We each knew one another’s cell phones, emails, and G-chat handles. Still, we discovered that in situations like this someone from the Ops team should be pulled in right away, until access can be provided to more of the team and accurate documentation is made available to everyone. We were lucky that we could coordinate and reach each other quickly to get to the bottom of the problem.

As we concluded discussing what we had learned and what action items we had as takeaways, everyone turned and headed back to their desks. It wasn’t until that moment that I realized I had never once been accused of anything. No one seemed agitated with me for the decisions I’d made and the actions I took. There was no blaming, shaming, or general animosity toward me. In fact, I felt an immense amount of empathy and care from my teammates. It was as though everyone recognized that they likely would have done the exact same thing I had.

Incident Readiness

The system was flawed, and now we knew what needed to be improved. Until we did so, the exact same thing was at risk of happening again. There wasn’t just one thing that needed to be fixed. There were many things we learned and began to immediately improve. I became a much better troubleshooter and gained access to parts of the system where I can make a significant positive impact in the recovery efforts moving forward.

For modern IT organizations, maintaining that line of reasoning and focus on improving the system as a whole is the difference between being a high-performing organization or a low-performing one. Those with a consistent effort toward continuous improvement along many vectors come out on top. Looking for ways to improve our understanding of our systems as well as the way in which teams respond to inevitable failure means becoming extremely responsive and adaptable. Knowing about and remediating a problem faster moves us closer to a real understanding of the state and behavior of our systems.

What would have happened if this latent failure of the automated backup process in the system had lain dormant for longer than just a few months? What if this had gone on for a year? What if it was happening to more than just Open CRM instances on AWS? What if we had lost data that could have taken down an entire company?

In order to answer those questions better, we will leverage the use of a post-incident review. A type of analytic exercise, post-incident reviews will be explored in depth in Chapter 8; you’ll see how we know what an incident is as well as when it is appropriate to perform an analysis.

As we’ll learn in the coming chapters, old-view approaches to retrospective analysis of incidents have many flaws that inherently prevent us from learning more about our systems and how we can continuously improve them.

By following a new approach to post-incident reviews, we can make our systems much more stable and highly available to the growing number of people that have come to rely on the service 24 hours a day, every day of every year.

What’s Next?

This short book sets out to explore why post-incident reviews are important and how you and your team can best execute them to continuously improve many aspects of both building resilient systems and responding to failure sooner.

Chapters 1 and 2 examine the current state of addressing failure in IT organizations and how old-school approaches have done little to help provide the right scenario for building highly available and reliable IT systems.

Chapter 3 points out the roles humans play in managing IT and our shift in thinking about their accountability and responsibility with regard to failure.

In Chapters 4 and 5 we will begin to set the context of what we mean by an incident and develop a deeper understanding of cause and effect in complex systems.

Chapter 6 begins to get us thinking about why these types of exercises are important and the value they provide as we head into a case study illustrating a brief service disruption and what a simple post-incident review might look like.

The remainder of the book (Chapters 7 through 10) discusses exactly how we can approach and execute a successful post-incident review, including resources that may help you begin preparing for your next IT problem. A case study helps to frame the value of these exercises from a management or leadership point of view.

We’ll conclude in Chapter 11 by revisiting a few things and leaving you with advice as you begin your own journey toward learning from failure.

Acknowledgments

I’d like to give an extra special “thank you” to the many folks involved in the creation of this report.

The guidance and flexibility of my editors Brian Anderson, Virginia Wilson, Susan Conant, Kristen Brown, and Rachel Head was greatly appreciated and invaluable. Thank you to Matthew Boeckman, Aaron Aldrich, and Davis Godbout for early reviews, as well as Mark Imbriaco, Courtney Kissler, Andi Mann, John Allspaw, and Dave Zwieback for their amazing and valuable feedback during the technical review process. Thanks to Erica Morrison and John Paris for your wonderful firsthand stories to share with our readers.

Thank you to J. Paul Reed, who was the first presenter I saw at Velocity Santa Clara in 2014. His presentation “A Look At Looking In the Mirror,” was my first personal exposure to many of the concepts I’ve grown passionate about and have shared in this report.

Special thanks to my coworkers at VictorOps and Standing Cloud for the experiences and lessons learned while being part of teams tasked with maintaining high availability and reliability. To those before me who have explored and shared many of these concepts, such as Sidney Dekker, Dr. Richard Cook, Mark Burgess, Samuel Arbesman, Dave Snowden, and L. David Marquet…your work and knowledge helped shape this report in more ways than I can express. Thank you so much for opening our eyes to a new and better way of operating and improving IT services.

I’d also like to thank John Willis for encouraging me to continue spreading the message of learning from failure in the ways I’ve outlined in this report. Changing the hearts and minds of those set in their old way of thinking and working was a challenge I wasn’t sure I wanted to continue in late 2016. This report is a direct result of your pep talk in Nashville.

Last but not least, thank you to my family, friends, and especially my partner Stephanie for enduring the many late nights and weekends spent in isolation while I juggled a busy travel schedule and deadlines for this report. I’m so grateful for your patience and understanding. Thank you for everything.

Get Post-Incident Reviews now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Post-Incident Reviews by Jason Hand