Key Takeaways
- Incidents are inevitable, but organizations can build resilience through investing in culture, process improvements, and learning.
- When improving incident response, focus on enhancing coordination, collaboration, and communication. Identify process gaps and opportunities to leverage automation to reduce cognitive load during incidents.
- Conduct blameless, narrative-based incident analyses focused on gathering multiple perspectives. Drive action items that can realistically be completed in short timeframes.
- Cross-incident analysis requires high quality individual incident data and analytical skills. Use insights to provide leadership with data-driven arguments for initiatives and transformations like adding headcount, adopting new solutions, or changing processes.
- Avoid common anti-patterns like over-focusing on Mean Time to X metrics, churning out action items that go unaddressed, and failing to effectively communicate insights to decision makers.
Incidents prevent us from meeting our goals. Whatever your goal is – such as selling tickets to the Taylor Swift concert, getting people home for the holidays without delays, or shipping goods across the globe – incidents will happen. In my talk at QCon San Francisco 2023, I shared my insights.
Fortunately, incidents don’t happen in a vacuum. We can learn from past experiences to reduce the impact of future incidents.
Organizations that invest in a culture of resilience will have the capacity to recover quickly from incidents and will be able to turn those incidents into opportunities.
Resilience is about having the capacity to withstand or to recover quickly from difficulties, to recover from your outages, from your incidents.
Resilience can help us turn those incidents into opportunities.
The Incident Lifecycle
No matter how hard we try, we will never be at zero incidents. They happen because we keep releasing code, we keep making changes. And that’s a good thing.
This is simply part of how a system achieves its goals. It’s all part of the incident lifecycle.
Things hum along smoothly (the system is normal), and then something happens to overload your system and bring it down (the incident begins). Maybe you unveiled a hot new product or started selling tickets to the biggest concert ever, leading to a stampede of eager fans. Regardless of the trigger, you’re now grappling with an all-out incident.
The immediate focus is restoring normal operations (incident resolution). Afterward, there are plenty of opportunities to evaluate what happened (post-incident activities) including formal and informal post-mortems. Maybe your company has a formal process, with documentation and identified action items. Or maybe the extent of your post-mortem is simply a few coworkers recapping the situation over lunch or drinks. Key learnings then get applied (learnings are applied) to improve the systems, enhancing on-call staffing, changing how requests queue up, or even spurring outside investigations if problems prove systemic.
It’s important to recognize that “the system” isn’t just lines of code – it’s the sociotechnical fabric combining technologies and teams. Each incident provides lessons that change how people handle future incidents. Our brains store this growing experience. Choices driven by post-incident wisdom mean the system evolves iteratively. We end up back in a system as normal state, after these learnings are applied.
How we handle incidents can lead us to more resilient infrastructure and wiser policies over time. Incidents provide us with an invitation to learn, since how we respond shapes the system – and our collective readiness for whatever happens next.
Resilience is Possible
Why do we focus on this at all? Simply put, incidents are expensive for companies. There are multiple ways that those expenses can show up:
- Site outages can cause loss of access and any income stream
- Reputational damage can mean less future use
- Workload interruption of staff, not only during the incident, but afterward as well.
- Goals and planning are affected, as engineers can’t focus on the work to move those forward.
There are three points within the incident lifecycle where we can focus time and energy to improve the learning cycle and gain some bandwidth to improve resilience in the system. It’s not easy, because you’ll generally have to make small adjustments and changes along the way. CTOs won’t generally approve $100,000 for cross-incident analysis (that won’t be a marketable improvement to stakeholders) without evidence that it’s helpful.
A Focus on Incident Response
Let’s start with the incident response itself. There are three main areas where you can seek to improve this part of the cycle:
- Coordination: Document the current workflow that you follow when an incident happens, even if it’s simply “call that person who’s been here for 10 years.” Identify where there may be gaps and talk about how to make those gaps a little smaller.
- Collaboration: How do you get people in a room to resolve the situation? How do you know who to call? I worked in one organization where I’d have to log into my laptop, get on the VPN and open up the wiki to find phone numbers to punch into my phone. That’s a lot of work to call one person. How can you make it easier to get people together?
- Communication: How are communications with stakeholders happening? How do you tell your customers what is happening? What are their expectations? Write some loose guidelines to help manage expectations so everyone understands what responses should look like. And remember that your stakeholders and your customers may need different information.
None of these improvements need to be perfect. We’re just looking for small improvements, ways to reduce cognitive load. Perhaps you can add some automation to help with that.
Anti-Pattern: MTTX
MTTX is a term I use to indicate mean time to discovery, mean time to recovery, to resolution, whatever single metric you may be using to track how quickly you complete something. Let me go on the record and state that those numbers don’t mean anything. It’s not that we shouldn’t care at all about how long our incidents last. However, a single number should not be the goal.
One time I had an issue where our data center caught fire, and the fire marshal wasn’t going to let us go back on. That’s just going to take longer. We want to make the experiences for our users and our engineers better, because that’s actually what’s going to help us get ahead. There are things that we can do to make incident response easier, so that our engineers are better equipped to resolve them. Just like we’re never going to be at zero incidents, we are not always in control of how long it takes to recover from an incident.
A Focus on Incident Analysis
After an incident occurs, it’s incredibly important to understand and learn from it through an incident analysis. I have found that a narrative-based approach helps unearth information. This way, you can highlight what happened, and how folks experienced the incident from different points of view.
First, it’s important to gather all the data about the incident. Look at who was involved, where were they located, how did communication happen – was it on Slack, Zoom or phone? Compile all of it into an organized repository with a timeline laying out how the incident played out. Make note of any open questions or gaps in the data. Then decide whether you need to set up a meeting, or if you can discuss it asynchronously, perhaps with a shared document.
You need perspectives from across the organization. The discussion shouldn’t include only the incident manager and the person who pushed the bad code. I find that folks in marketing, product management, and especially customer support have great insights into the impact of an incident.
When you meet, make sure it’s an open conversation – the person facilitating should be talking less than anyone else in the room. This way, you will capture how this incident affected different groups. You may learn, for example, that the on-call engineer lacked dashboard access or customer support got slammed with complaints.
Finally, distill down the key insights and action items into a format suited for your audience. Maybe a short executive summary is needed for leaders, and a list of identified action items for different teams. Share out the findings so they actually get incorporated into improving your resilience.
The goal is to effectively communicate contributing factors and the incident’s impact, in order to empower change. Some additional resources for improving your incident analysis process include the Jeli Howie Guide and the Etsy’s Debriefing Facilitation Guide for Blameless Postmortems.
Anti-Pattern: Action Items Factory
It’s important to avoid the common anti-pattern of simply producing action items from post-incident reviews without following through. Creating alerts for specific issues or generating fixes that get deprioritized into a forgotten Google doc or backlog only breeds mistrust in the process. If the proposed fixes and improvements never materialize, engineers and stakeholders will consider review meetings a waste of time.
In order to resist this anti-pattern, incident reviews should examine organizational resilience more broadly, rather than focusing solely on action items. Quite often, we tend to see either tiny action items or overly ambitious ones, when we need to find a middle-ground – the “Goldilocks” actions that are just right in terms of scope and priority. By taking a step back from narrow fixes, organizations can uncover more opportunities for systemic resilience enhancements and determine where to best allocate resources. But any identified improvements must get rolled out if we want to maintain trust and engagement in the process.
The perfect action item is something that can be prioritized among other currently planned work and completed within the next month or two. It also should move the needle in relation to the incident.
A Focus on Cross-Incident Insights
Cross-incident analysis across multiple post-mortems is crucial for identifying systemic issues and making meaningful improvements. By aggregating insights from individual high-quality incident reviews, organizations can uncover patterns and deficiencies that may require company-wide initiatives or transformation such as adjustments in staffing, vendor changes, or new solutions.
Driving this analysis should be a collaborative effort, incorporating perspectives from engineering, product, customer support, marketing, and leadership. No one person should own the process. Presenting the data in an easy-to-digest manner is also critical for contextualizing the issues and convincing leadership to support major changes.
Many organizations struggle with effective cross-incident analysis currently because they lack high-quality data from incident reviews or that data is scattered across multiple mediums which hinders analysis. Additionally, most engineers are not trained data analysts equipped to compile insights in this manner.
When done well, these insights can paint a clearer picture of pain points across incidents. Rather than call out specific teams, the goal is to provide helpful support by reallocating resources or updating processes. This level of resilience enables engineers to better avoid issues down the line and builds an environment for innovation.
One example of gaining insight from cross-incident analysis is understanding where your incidents are taking place. Maybe one particular area of your codebase had the majority of incidents. How can you better support the teams that are managing that area? Or does the code need some re-architecting?
I’ve done this with a number of teams. They’ve been able to make organizational-wide decisions based on trends of incidents. We’ve seen people use these insights to decide on things like feature flag solutions, or decide to switch vendors, maybe even make organizational changes. All of this creates an environment where engineers are able to thrive.
Anti-Pattern: MTTX (again)
There’s a risk with cross-incident analysis of falling into the MTTX anti-pattern as well. Some executives may request some set of metrics that really won’t provide practical insight. No engineer or analyst is paid enough to argue with the CTO, so instead I encourage you to provide context here. I call this the “veggies in the pasta sauce” method. The metric itself may not be helpful, but the context that you slip in with it makes the information that much richer.
Anti-Pattern: Not Communicating the Insights
I want to share one last anti-pattern, which is related to cross-incident insight, but can apply to any insights gathered from evaluating incidents. Our work should not just be completed to be filed away.
Any documentation made should be created so it can be read and shared across the organization, even long after we’ve taken our initial actions. I’ve often sat in review meetings where engineers say, “We all know what to do, we just don’t do it.” My question is – do we all know? Has this been clearly communicated?
Often, people don’t engage with our learnings because the format feels irrelevant to them. Different audiences need to grasp different insights. While your CTO may grasp technical details, you’ll lose others in leadership. Tailor the language and detail level to what your audience needs to understand. Focus on your goal. Don’t dwell on unimportant specifics. If making a case to rearchitect a system, explain the rationale based on data and leave it at that.
Consider your format as well. Do you use insights, metrics, storytelling, and technical detail? Start with your top insight. For example, “X proportion of incidents relate to our outdated CI/CD pipeline. This impacts all products and makes onboarding difficult. Feedback from stakeholders and experts recommends focusing next quarter on rearchitecting this.” We suggest this not because of a whim, but because data backs it up. Show your work. The person hearing this information can follow the trail from individual incident analysis to cross-incident insights to action.
Summary
We will never be at zero incidents. Technology evolves and new challenges come up, but we focus on incident response and incident analysis and cross-incident insights while looking out for those anti-patterns, we can lower their cost. We can be better prepared to handle incidents, which is going to lead to a better experience for our users. It’s going to lead to a culture where engineers are engaged, and they have the bandwidth not only to complete their work, but also to creatively solve problems to help us achieve the goals of our organizations.