Three Mile Island

It’s instructive to read accounts of colossal engineering failures and to consider how one could apply them to software and software testing.

The riveting and somewhat clinical accounts of the ethical implications of the incident reveal lots of details about how people react to crises and how the tools (hardware & software) they are provided with can help or hinder the resolution.

Of particular interest to those of us doing devops work or building support tools are some of these tidbits from the report of the President’s Commission:

Over 100 alarms went off in the early stages of the accident with now way of suppressing the unimoprtant ones and identifying the important ones…

Several instruments went off-scale during the course of the accident … these instruments were not designed to follow the course of an accident.

[My favorite] The computer printer registering alarms was running more than 2-k hours behind the events and at one point jammed, thereby losing valuable information.

(Page 29-30 in print)

Are the tools that monitor your system capable of suppressing unimportant alarms in a crisis? Are your metrics scaled? Is your monitoring/logging system capable of handling dramatically more messages than usual – does it get behind, or does it drop messages? There are obvious drawbacks to either approach.

Are the controls to investigate disasters as well designed as your product? TMI was full of design flaws in controls, from labels which obscured critical indicator lights to necessary switches being located on the wrong side of equipment, etc.

(I have a vague memory of a similar incident with a teletype machine failing to print alerts in a scifi thriller – maybe Michael Crichton? Please write a comment if you remember the source and whether the author was inspired by the TMI incident.)

PDF of the President’s Commission Report on Three Mile Island

Start a discussion thread: