Almost every summer, California has many incidents of the forest
fire. Unfortunately, these fires are deadly and sometimes results in loss of
human life. Firefighters put their whole to contain and extinguish these fires.
After a fire in contained and extinguished, the postmortem of the incident is
performed not to blame individual firefighters but to learn from it, so
next fire can be tackled ion better way. Each incident provides an opportunity
to learn and improve. U.S. Forest Service does not blame its firefighters
but promotes accountability and transparency by encouraging to speak the truth
without fear of reprisal.
Someone can argue, why not stop the fire itself. Do you think it
is possible? Forest growth and fire are part of natural cycle. You may reduce
the incident, you can handle the incidents effectively and efficiently but
can’t eliminate. As working of the system is normal, so the failure.
Let’s jump to software support and maintenance arena. It is Black
Friday sale but during peak hours, Point of Sale system is slow and credit card
processing is slower. We are losing money every moment as lost sales. It
is PRIORITY ONE (P1) and SEVERITY ONE (S1) incident. Engineers have been called
in the war room to resolve the issue NOW. Incident has high visibility, even
CTO, and other executives are involved on the conference call. Ron - system
admin, issues a command to check server status. Within a minute, server
crashes. Thanks to load balancing, traffic switches to backup server but still
it is low. Ron logs in to this server too and to check status issues a command.
Boom within few minutes, this server goes offline. No automatic credit card
processing. Stores are instructed to use old manual way of card processing.
Millions are lost. Ron and team are still working with servers and after
frenetic calls with server vendor, Ron and team zero-in on a recent patch
as probable cause. Patch is rolled back and system recovers. But a day is
lost and so the millions. Patch has introduced a bug which gets into effect if
certain conditions are met like high traffic and execution of the status
command. Gosh!
Next day, RCA is performed and Ron is asked to leave because of
operator error. He was blamed for being careless. Why he did not pay attention
that his status command was the reason of crash of the first server.
Do you think our retailer’s IT team lost a great opportunity of
learning from the incident? Ron’s departure is results of cognitive biases and
hindsight. Environmental conditions (pressure from executives to fix it now,
the innocuous status command which is used regularly hundreds of times a day by
sys admins, …) and time lines are not properly analyzed and constructed without
benefit of hindsight. Is in complex systems, operator error is the trivialization
of operator’s intellectual capacity and system’s complexity? Is this Agile Way
of doing postmortem? NO, it is not.
First, I will not like to call postmortem. We do the post-mortem
of dead bodies not of live systems. I like to call is Learning Workshop. The
goal of Learning Workshop is to learn from failures and successes to act
effectively and efficiently in future. Learning Workshop is an exercise to
prepare for future successes and failures not to blame someone. Learning
Workshop accepts that failure is part of the system as the successes. Learning
Workshop does not look for sacrificial lamb; it looks for accountability and
transparency with the assurance that whatever shared here will not be used
against you. Just remember how Retrospective is conducted in Scrum – whatever
discussed here will remain here, it is the private meeting. In the case of
Learning Workshop, findings should be shared widely, so learning can be
maximized.
While conducting Learning Workshop follow some simple guidelines:
·
Beware of cognitive biases
·
Accept that success and failures both are
normal
·
Do not take benefit of hindsight
·
Goal is learning not to look for sacrificial
lamb
Now the choice
is yours, do you want to do postmortem or Learning Workshop?
No comments:
Post a Comment