Human error is a symptom—never the cause—of trouble deeper within the system.
Dave Zwieback
Failure is inevitable in any complex system. While it’s tempting to find a single person to blame, these failures are usually the results of broader design issues in our systems. We are born and raised in a culture where blaming is one of the mechanisms to overcome fear and a way to discharge pain and discomfort. The good news is that we as humans have designed systems to reduce the risk of human errors by looking into factors contributing to failure –systemic and human.
Blameless Post-Mortem mechanism essentially is a post-correction retrospective for a failure. The purpose of Blameless Post-Mortem is to find the cause of the failure happened, identifying corrective actions so the probability of occurring of future failures can be reduced, and learning.
Whatever discover, we understand and truly believe that everyone did the best job they could, given what they knew at that time, skills, abilities, the resources available, and the situation at hand.
1. The people involved in the decision and actions that may have contributed to the failure/problem.
2. The people who identified the problem.
3. The people who responded to the problem.
4. The people who diagnosed the problem.
5. The people who were affected by the problem.
6. Anyone who is interested in the process.
7. A facilitator
In a Blameless Post-Mortem meeting, participants will be doing the following:
1. Constructing the timeline of failure. This timeline must incorporate different perspectives to understand the whole picture. People must be assured (repeatedly – by words as well as actions) that nothing will be used against them EVER.
2. Empower all participants to speak and share their version of account detailing their contribution to failure. While people speaking few things must be observed:
a. Don’t use “would”, “could”, and “should”
b. List out the assumptions
c. Actions, expected outcomes, and real outcome
d. Understanding of timeline as actions are taken
e. People who were on the keyboard should speak not the people who are just managing the people.
3. Conduct 5-WHYs to find out the root cause of the failure.
4. Identify top three action items.
5. One or more participants voluntarily assume responsibility to materialize identified action items and inform all participants to close the loop. The action items must be measurable.
Blameless Post-Mortem must be conducted as soon after the resolution of failure – as a thumb rule within 48-hours.
The most important thing in Blameless Post-Mortem – No one is blamed for failure, everyone learned some important lesson and team/organization gained resiliency.
The facilitator of Blameless Post-Mortem must be aware of few biases which may affect the whole process:
1. Hindsight bias, also known as the knew-it-all-along effect or creeping determinism, is the inclination, after an event has occurred, to see the event as having been predictable, despite there having been little or no objective basis for predicting it. It is a multifaceted phenomenon that can affect different stages of designs, processes, contexts, and situations. Hindsight bias may cause memory distortion, where the recollection and reconstruction of content can lead to false theoretical outcomes. It has been suggested that the effect can cause extreme methodological problems while trying to analyze, understand, and interpret results in experimental studies. A basic example of the hindsight bias is when, after viewing the outcome of a potentially unforeseeable event, a person believes he or she "knew it all along".
2. Confirmation bias also called confirmatory bias or myside bias, is the tendency to search for, interpret, favor, and recall information in a way that confirms one's preexisting beliefs or hypotheses. It is a type of cognitive bias and a systematic error of inductive reasoning. People display this bias when they gather or remember information selectively, or when they interpret it in a biased way. The effect is stronger for emotionally charged issues and for deeply entrenched beliefs.
3. The outcome bias is an error made in evaluating the quality of a decision when the outcome of that decision is already known. Specifically, the outcome effect occurs when the same "behavior produce[s] more ethical condemnation when it happen[s] to produce bad rather than good outcome, even if the outcome is determined by chance."
While similar to the hindsight bias, the two phenomena are markedly different. The hindsight bias focuses on memory distortion to favor the actor, while the outcome bias focuses exclusively on weighting the past outcome heavier than other pieces of information in deciding if a past decision was correct.
References:
1. The DeveOps Handbook by Gene Kim, Jez Humble, Patrick Debois, John Willis
2. Beyond Blame: Learning From Failure and Success by Dave Zwieback
3. The Field Guide to Understanding Human Error by Sidney Dekker
4. Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
5. The power of vulnerability by Brené Brown (http://www.ted.com/talks/brene_brown_on_vulnerability)
6. Listening to shame by Brené Brown (http://www.ted.com/talks/brene_brown_listening_to_shame)
7. Hindsight bias - https://en.wikipedia.org/wiki/Hindsight_bias
8. Confirmation bias - https://en.wikipedia.org/wiki/Confirmation_bias
9. Outcome bias - https://en.wikipedia.org/wiki/Outcome_bias
10. Open source Blameless Post-Mortem tool: Morgue- https://github.com/etsy/morgue