A few words about Blameless culture

The concept of blameless culture has been around for a long time in other industries, and while the history isn’t clear, you could argue that it became an “official” part of the tech industry with the publication of the definitive book Site Reliability Engineering in 2016.

My summary of blameless culture is: when there is an outage, incident, or escaped bug in your service, assume the individuals involved had the best of intentions, and either they did not have the correct information to make a better decision, or the tools allowed them to make a mistake.

During my tenure at Arctic Wolf I embraced and evangelized blameless culture, and brought it into every part of our organization that I could, including of course the post-mortem process. That culture took on a life of its own, as they do, and in some cases it grew in directions I hadn’t expected. Before I expand on what my definition of blameless is, I’ll share a personal story that’s somewhat embarrassing but frames my journey.

Once upon a time, when I was young and cocky, I was a Unix system administrator at a large manufacturing company. I responded to a call one night where the power for a data center was cut during maintenance of a fire suppression system in the building, causing immediate and expensive downtime to the manufacturing plant. At the beginning of the subsequent root cause meeting (the modern post-mortem language wasn’t common in the industry yet) I blurted out something to the effect that the meeting was a waste of time because we knew that the fire department tripped the power. Awkward silence ensued, and I discovered the gentleman sitting beside me was the fire chief. He had every right to express his anger at my blameful comment, but instead he quietly explained the process of determining root cause without assigning blame. As it turned out I was wildly wrong and the cause was, as is so often the case, cascading failure with a side dish of unexpected behavior, and no one could have predicted the end result.

That episode stuck with me for a few reasons, in large part because it was an embarrassing moment, but resulted in a few key lessons which took some time to percolate. That moment was a clear example of how to be blameful, not blameless. A more important reason this sticks in my memory is the fantastic leadership example the fire chief set for me, where a senior leader took the opportunity to mentor a newer employee through a potentially bad situation, and I’ve tried to emulate that wherever I can. He very likely has no idea how much impact he had on my career, but I’m very thankful for his patience with me that day.

My favorite public example of blameless behavior is when AWS S3 went down in the us-east-1 region on February 28, 2017, taking down a good portion of the Internet. AWS wrote in the public post mortem “an authorized S3 team member using an established playbook” … performed some maintenance, and the team member inadvertently introduced a typo which resulted in the outage. AWS’s response could have been “an employee made a mistake and we fired them” but instead my brief summary of the response was: the tool did not have enough safeguards and the service’s recovery was not fast enough. AWS addressed the situation by building in more safeguards, and implemented improvements to the service recovery time.

One unexpected direction the blameless culture took, which in hindsight I’m happy about, is to not name individuals beyond the boundaries of the team. This can result in the perception by senior leadership, who may not fully understand blameless culture, that their teams were not taking responsibility for their work. Talented developers in small teams are typically quick to point out their own mistakes, but I insist on not stopping post-mortems when someone points out their own mistake, and continuing investigating how to improve tools and information. The individual’s name is not important, almost immaterial to the postmortem, but the steps needed to improve the systems are the key parts to communicate to company leaders.

Finding the balance between taking responsibility for an issue and not naming names is difficult, but increasingly important as an organization grows. While I fully expect my team to own up to any mistakes, typos, assumptions etc during a post-mortem or in private, their names should be kept out of any docs published to the broader organization. Removing names provides psychological safety for team members, and prevents the temptation by leadership or others to blame individuals and take action. When presented with names I’ve seen leaders take immediate vindictive action, like termination, but also delayed action affecting compensation or promotion decisions later. In my ideal world the team as a whole, and their respective leadership chain takes responsibility for the events.

In my definition of blameless I “assume … best of intentions”. But. I’ve always made it clear to my teams there is an unwritten exception clause: if an incident/outage/bug was caused by either deliberate action, or by mistake and then covered up, there would be immediate consequences that most likely would result in the individuals finding new employment. Deliberate action causing damage is easy to understand and will likely involve law enforcement, but covering up mistakes deserves a bit more attention. I have two examples, one negative and one positive:

I was part of the response team in situation where an individual made a mistake, and then attempted to cover their tracks. That resulted in a spectacular 48-hour outage and shortly thereafter termination of their employment. The org’s leadership was clear that the individual was not fired for making the mistake, but for the attempted coverup.

Back in my sysadmin days a colleague accidentally deleted a production database. They immediately took their hands off their keyboard and said “hey team, I just messed up and need help”. We rallied together and came up with a solution without causing an expensive outage. My colleague could have covered it up, and I’m not sure we would ever have found why the database had disappeared, but because we had a blameless culture and great team, the individual had the psychological safety net to immediately ask for help.
Early in my career a leader told me, well before chaos engineering was a thing, “If you’re not breaking things once in a while you’re not working hard enough.” Anyone who runs production services lives this every day and knows there is a risk of making a mistake that will cause an outage. The best possible action when making a mistake is immediately escalate to your team and leadership and ask for help to help remediate the issue. At Arctic Wolf we had a cultural tenet “Bad News Fast” which we applied to projects and incidents alike, and contributed heavily to the company’s success. Team members were rewarded and praised for raising bad news fast, whether it was an issue in production, or an upcoming capacity planning threshold that would cause months of rework, or finding an issue in a project that would significantly delay the delivery.

Blameless culture cannot be reserved only for postmortems, it needs to be lived and promoted every day. Rob Zuber of CircleCI summarized it really well in The value of blameless culture: “Incidents are the microscope under which we tend to examine blameless culture” and “… the culture that shows up in your incident response is going to be a direct reflection of the culture that you build every day, with every action, under the most mundane circumstances.”

Building a blameless culture is a complex topic, but one well worth investing a lot of time in, and will result in a healthy and high performing organization. Lots of great books and articles have been written on how to implement it, with a few key ones here:

Want to learn more? Ping me via LinkedIn, I’d be happy to chat.

Feeds item

https://www.gybe.ca/a-few-words-about-blameless-culture/