Saturday, November 5, 2016

Agile way of postmortem (or I should say Learning Workshop)



Almost every summer, California has many incidents of the forest fire. Unfortunately, these fires are deadly and sometimes results in loss of human life. Firefighters put their whole to contain and extinguish these fires. After a fire in contained and extinguished, the postmortem of the incident is performed not to blame individual firefighters but to learn from  it, so next fire can be tackled ion better way. Each incident provides an opportunity to learn and improve.  U.S. Forest Service does not blame its firefighters but promotes accountability and transparency by encouraging to speak the truth without fear of reprisal.

Someone can argue, why not stop the fire itself. Do you think it is possible? Forest growth and fire are part of natural cycle. You may reduce the incident, you can handle the incidents effectively and efficiently but can’t eliminate. As working of the system is normal, so the failure.
Let’s jump to software support and maintenance arena. It is Black Friday sale but during peak hours, Point of Sale system is slow and credit card processing is slower. We are losing money every moment as lost sales.  It is PRIORITY ONE (P1) and SEVERITY ONE (S1) incident. Engineers have been called in the war room to resolve the issue NOW. Incident has high visibility, even CTO, and other executives are involved on the conference call. Ron - system admin, issues a command to check server status.  Within a minute, server crashes. Thanks to load balancing, traffic switches to backup server but still it is low. Ron logs in to this server too and to check status issues a command. Boom within few minutes, this server goes offline. No automatic credit card processing. Stores are instructed to use old manual way of card processing. Millions are lost. Ron and team are still working with servers and after frenetic calls with server vendor, Ron and team zero-in on a recent patch as probable cause. Patch is rolled back and system recovers. But a day is lost and so the millions. Patch has introduced a bug which gets into effect if certain conditions are met like high traffic and execution of the status command. Gosh!

Next day, RCA is performed and Ron is asked to leave because of operator error. He was blamed for being careless. Why he did not pay attention that his status command was the reason of crash of the first server.

Do you think our retailer’s IT team lost a great opportunity of learning from the incident? Ron’s departure is results of cognitive biases and hindsight. Environmental conditions (pressure from executives to fix it now, the innocuous status command which is used regularly hundreds of times a day by sys admins, …) and time lines are not properly analyzed and constructed without benefit of hindsight. Is in complex systems, operator error is the trivialization of operator’s intellectual capacity and system’s complexity? Is this Agile Way of doing postmortem? NO, it is not.

First, I will not like to call postmortem. We do the post-mortem of dead bodies not of live systems. I like to call is Learning Workshop. The goal of Learning Workshop is to learn from failures and successes to act effectively and efficiently in future. Learning Workshop is an exercise to prepare for future successes and failures not to blame someone. Learning Workshop accepts that failure is part of the system as the successes. Learning Workshop does not look for sacrificial lamb; it looks for accountability and transparency with the assurance that whatever shared here will not be used against you. Just remember how Retrospective is conducted in Scrum – whatever discussed here will remain here, it is the private meeting. In the case of Learning Workshop, findings should be shared widely, so learning can be maximized.

While conducting Learning Workshop follow some simple guidelines:

·              Beware of cognitive biases
·              Accept that success and failures both are normal
·              Do not take benefit of hindsight
·              Goal is learning not to look for sacrificial lamb
  
Now the choice is yours, do you want to do postmortem or Learning Workshop?

No comments:

Post a Comment