Monday, March 26, 2018

Evolution of DevOps


Scenario 1: I have a check to deposit. I go to the bank, ask a teller to deposit the check in my account. He takes my check and makes few entries in his register. In the evening he will transmit the details of the check to some centralized location which will verify that person who has signed the check has sufficient funds his account or not and also verify the signature. If everything is fine, the amount will be transferred to my account. This whole process may take a couple of days.

Scenario 2: I have a check to deposit. I go to the bank, ask a teller to deposit the check in my account. He takes my check and verifies the check details an online system which he has access. If everything is fine (availability of funds and signature match), the fund will be transferred to my account in a couple of minutes.

Scenario 3: I have a check to deposit. I open bank's app on my smartphone, take the picture of the check, make few entries and in few minutes funds are in my account.

The above three paragraphs show how banking has evolved. from the consumer's point of view. In the first scenario, the system was manual and the consumer has to make a trip to bank and fund transfer used to take a couple of days. In the second scenario, the system is the same consumer has to make a trip to the bank but teller has access to automation which stripped away the delay. In the last scenario, the consumer has access to a self-service portal which has removed the intermediary (bank teller) and makes consumer responsible.

Let's come to the IT. Earlier we were working in the era of brick and mortar era of the bank where the developers' have to visit Operations and request for deployment. With automation, Operations job becomes easier (assuming complexity of the infrastructure remain same) and faster, still developers need the intermediary (operations team). Now we are entering the new era where the distinction between Dev and Ops is blurring because of self-service and increasing automation not only at App level but at Infrastructure level as well.

Monday, February 19, 2018

Blameless Post-Mortem



Human error is a symptom—never the cause—of trouble deeper within the system.
Dave Zwieback




Failure is inevitable in any complex system. While it’s tempting to find a single person to blame, these failures are usually the results of broader design issues in our systems. We are born and raised in a culture where blaming is one of the mechanisms to overcome fear and a way to discharge pain and discomfort. The good news is that we as humans have designed systems to reduce the risk of human errors by looking into factors contributing to failure –systemic and human.

Blameless Post-Mortem mechanism essentially is a post-correction retrospective for a failure. The purpose of Blameless Post-Mortem is to find the cause of the failure happened, identifying corrective actions so the probability of occurring of future failures can be reduced, and learning.

Like retrospective, there is the prime directive for continuing successful Blameless Post-Mortem:

Whatever discover, we understand and truly believe that everyone did the best job they could, given what they knew at that time, skills, abilities, the resources available, and the situation at hand. 

In Blameless Post-Mortem following people must participate:
1.    The people involved in the decision and actions that may have contributed to the failure/problem.
2.    The people who identified the problem.
3.    The people who responded to the problem.
4.    The people who diagnosed the problem.
5.    The people who were affected by the problem.
6.    Anyone who is interested in the process.
7.    A facilitator

In a Blameless Post-Mortem meeting, participants will be doing the following:

1.     Constructing the timeline of failure. This timeline must incorporate different perspectives to understand the whole picture. People must be assured (repeatedly – by words as well as actions) that nothing will be used against them EVER.
2.    Empower all participants to speak and share their version of account detailing their contribution to failure. While people speaking few things must be observed:
a.    Don’t use “would”, “could”, and “should”
b.    List out the assumptions
c.    Actions, expected outcomes, and real outcome
d.    Understanding of timeline as actions are taken
e.    People who were on the keyboard should speak not the people who are just managing the people.
3.    Conduct 5-WHYs to find out the root cause of the failure.
4.    Identify top three action items.
5.    One or more participants voluntarily assume responsibility to materialize identified action items and inform all participants to close the loop. The action items must be measurable.

Blameless Post-Mortem must be conducted as soon after the resolution of failure – as a thumb rule within 48-hours.



The most important thing in Blameless Post-Mortem – No one is blamed for failure, everyone learned some important lesson and team/organization gained resiliency.

The facilitator of Blameless Post-Mortem must be aware of few biases which may affect the whole process:

1. Hindsight bias, also known as the knew-it-all-along effect or creeping determinism, is the inclination, after an event has occurred, to see the event as having been predictable, despite there having been little or no objective basis for predicting it. It is a multifaceted phenomenon that can affect different stages of designs, processes, contexts, and situations. Hindsight bias may cause memory distortion, where the recollection and reconstruction of content can lead to false theoretical outcomes. It has been suggested that the effect can cause extreme methodological problems while trying to analyze, understand, and interpret results in experimental studies. A basic example of the hindsight bias is when, after viewing the outcome of a potentially unforeseeable event, a person believes he or she "knew it all along".


2. Confirmation bias also called confirmatory bias or myside bias, is the tendency to search for, interpret, favor, and recall information in a way that confirms one's preexisting beliefs or hypotheses. It is a type of cognitive bias and a systematic error of inductive reasoning. People display this bias when they gather or remember information selectively, or when they interpret it in a biased way. The effect is stronger for emotionally charged issues and for deeply entrenched beliefs.


3. The outcome bias is an error made in evaluating the quality of a decision when the outcome of that decision is already known. Specifically, the outcome effect occurs when the same "behavior produce[s] more ethical condemnation when it happen[s] to produce bad rather than good outcome, even if the outcome is determined by chance."

While similar to the hindsight bias, the two phenomena are markedly different. The hindsight bias focuses on memory distortion to favor the actor, while the outcome bias focuses exclusively on weighting the past outcome heavier than other pieces of information in deciding if a past decision was correct.
 


References:
1. The DeveOps Handbook by Gene Kim, Jez Humble, Patrick Debois, John Willis
2. Beyond Blame: Learning From Failure and Success by Dave Zwieback
3. The Field Guide to Understanding Human Error by Sidney Dekker
4. Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer,‎ Chris Jones,‎ Jennifer Petoff,‎ Niall Richard Murphy
5. The power of vulnerability by Brené Brown (http://www.ted.com/talks/brene_brown_on_vulnerability)
6. Listening to shame by Brené Brown (http://www.ted.com/talks/brene_brown_listening_to_shame)
7. Hindsight bias - https://en.wikipedia.org/wiki/Hindsight_bias
8. Confirmation bias - https://en.wikipedia.org/wiki/Confirmation_bias
9. Outcome bias - https://en.wikipedia.org/wiki/Outcome_bias
10. Open source Blameless Post-Mortem tool: Morgue- https://github.com/etsy/morgue