February 10, 2012

RIP Roger Boisjoly: Some Underappreciated Lessons of the Challenger Disaster


Roger Boisjoly has died.

The name may not ring a bell, but Boisjoly's place in history certainly will: He was the engineer who tried in vain to persuade NASA that it was unsafe to launch the space shuttle Challenger on January 28, 1986.

The Challenger explosion remains today one of our most evocative images of technology gone wrong. This is due in part to the personal nature of the tragedy – the schoolteacher onboard, the family members watching – and in part to the subsequent revelations that NASA proceeded with the launch despite Boisjoly's warnings.

My intention here is not to rehash the chain of events that led to the Challenger's demise, but to show how some of those events demonstrate patterns of error that are commonplace – indeed, almost inevitable – in the operation of complex technological systems. 

These thoughts have been inspired mainly by the analysis of the Challenger explosion provided by Harry Collins and Trevor Pinch in their book, The Golem at Large: What You Should Know About Technology. Other key sources include Charles Perrow's Normal Accidents: Living With High-Risk Technologies and Jay Hamburg's reporting in The Orlando Sentinel.

I'll group the patterns to be discussed – let's call them Underappreciated Contributing Dynamics – in two categories, the first involving the question of certainty, the second involving the consequences of human interaction with machines.

Underappreciated Contributing Dynamic #1: There is no certainty.

The Challenger explosion is thought to have occurred because the O-rings separating sections of the booster rockets that powered the shuttle's ascent into space failed to seal properly. The failure of the seals allowed a tiny gap to form between the sections. Flaming gas leaked through the gap and exploded.

The conventional wisdom is that NASA bureaucrats, anxious to press forward with the launch largely for public relations reasons, ignored the warnings of Boisjoly and others who recognized the danger and tried to stop the launch. There's truth to that narrative – comforting truth, because it reassures us that if we only follow the proper procedures, such accidents can be prevented. In practice, it's not that simple.

Engineers at NASA and Morton Thiokol, the contractor responsible for building the booster rockets, had known for years that there was a problem with the seals. The question was not only what was causing the problem and how to fix it, but also whether the problem was significant enough to require fixing.

According to Collins and Pinch, the O-rings were just one of many shuttle components that didn't perform perfectly and about which engineers had doubts. To this day, they add, we can't be sure the O-rings were the sole cause of the explosion. "It is wrong," they write,

to set up standards of absolute certainty from which to criticize the engineers. The development of an unknown technology like the Space Shuttle is always going to be subject to risk and uncertainties. It was recognized by the working engineers that, in the end, the amount of risk was something which could not be known for sure.

Part of the uncertainty regarding the O-rings was that NASA and Morton Thiokol could never determine exactly how large the gaps in the seals became in liftoff conditions, and thus how serious a danger they represented. Countless tests were run trying to answer that question, but they consistently produced inconsistent results. This was so in part because NASA's and Morton Thiokol's engineers couldn't agree on which measuring technique to trust. Each side, say Collins and Pinch, believed its methods were "more scientific," and therefore more reliable.

Charles Perrow writes that the inability to pinpoint the source of technical failures is especially common in what he calls "transformation" systems, such as rocket launches or nuclear power plants: the intricacy of the relationships between parts and processes ("tight coupling") makes it impossible to separate cause and effect. "Where chemical reactions, high temperature and pressure, or air, vapor or water turbulence [are] involved," he writes,

we cannot see what is going on or even, at times, understand the principles. In many transformation systems we generally know what works, but sometimes do not know why. These systems are particularly vulnerable to small failures that 'propagate' unexpectedly, due to complexity and tight coupling.

Roger Boisjoly's suspicion that cold weather was the source of the Challenger's O-ring problem was just that – a suspicion. As of the night before the Challenger launch, he had some evidence to back up his suspicion, but not enough to prove it. On the strength of Boisjoly's concerns, his superiors at Morton Thiokol initially recommended that the launch be delayed, but NASA's managers insisted on seeing data that quantified the risk. Unable to provide it, Morton Thiokol's managers reversed their recommendation, and the launch was approved. 

Roger Boisjoly

Underappreciated Contributing Dynamic #2: The Double Bind of the Human Factor

We know now that Morton Thiokol's managers should have supported their engineer's conclusions and held their ground, and that NASA, upon hearing there was a possibility of catastrophic failure in cold weather, should have exercised caution and postponed the launch. Again, all that is true, but it's not the whole truth. To pin the blame on irresolute and impatient managers is to underestimate the complexities of the human dynamics that led to the decision.

We like to think that sophisticated machines are reliable in part because they eliminate human error. In truth complex technological systems always include a human component, and therein lies the dilemma. There's no shortage of examples before and after Challenger proving that the interaction of human beings and machine can end badly. It's also well known that we ask for trouble when we unleash powerful technologies without including human judgment in the mix. Human beings: can't live with 'em, can't live without 'em.

A subcategory of the human factor dilemma is what Charles Perrow calls the "double penalty" of high-risk systems. The complexity of those systems means that no single person can know all there is to know about the myriad elements that comprise them. At the same time when the system is up and running, one central person needs to be in control. This is especially true in crisis situations when the person in control is called upon to take, as Perrow puts it, "independent and sometimes quite creative action." Thus complex technological systems present us with built-in "organizational contradictions."

Communication issues can exacerbate those organizational contradictions. Middle level managers, for example, may decide that it's unnecessary to pass relevant information up the chain of command. In Challenger's case, many of NASA's senior executives were unaware of the ongoing questions regarding the booster seals. It's likely no one told the astronauts, either. Opportunities for misunderstanding also arise from the manner in which information is offered and from the manner in which it's interpreted. On at least two occasions NASA managers shrugged off engineers' warnings about the risks of cold-weather launches because the engineers themselves didn't seem, as far as NASA's managers could tell, that alarmed about them.


 
Collins and Pinch stress that in many respects the arguments between NASA and Morton Thiokol the night before the Challenger launch were typical of the sorts of arguments engineers and their bosses (also engineers, usually) routinely engage in as they iron out problems in complex technological operations. And, as mentioned above, these were continuations of discussions that NASA and Morton Thiokol had been having over the O-ring problem literally for years.

The longevity of those arguments actually became a barrier to their resolution. Some of the engineers at NASA and Morton Thiokol had invested so much time and energy in the O-rings that they developed a sort of psychological intimacy with them. Believing the problem fell within acceptable margins of risk, they grew comfortable wrestling with it. It was a problem they knew. This is an example of a phenomenon called "technological momentum." Simply put, habits of organizational thought and action become embedded and increasingly resistant to change. Devising an entirely new approach to the booster seals – one that would surely have had its own problems – was a step the shuttle engineers were reluctant to take, given the pressure they were under to move the project forward. Roger Boisjoly was able to look at the booster problem differently because he joined Morton Thiokol several years after the shuttle project had begun.

A major reason NASA's engineers were inclined to resist Morton Thiokol's recommendation that the launch be scrubbed because of the cold weather was that temperature had never before been presented to them as a determinative element in a launch/no launch decision. This wasn't Roger Boisjoly's fault: the freezing temperatures on the eve of the launch were a fluke, and therefore presented conditions that hadn't been encountered before. Nonetheless the novelty of Boisjoly's theory helped sway the consensus against him, as did his admitted lack of definitive data.

"What the people who had to make the difficult decision about the shuttle launch faced," Collins and Pinch write,

was something they were rather familiar with, dissenting engineering opinions. One opinion won and another lost, they looked at all the evidence they could, used their best technical standards and came up with a recommendation.

This may seem a cold assessment in light of what occurred, and Collins and Pinch aren't arguing that the decision the engineers made that night was correct. Obviously it wasn't. Still, the question must be asked: Isn't this exactly the sort of rational decision-making we generally prize in our scientists and technicians?

We understand that human judgment is fallible, but when complex technological systems go awry, we want to insist that it shouldn't be. Which is to wish for another sort of double jeopardy: to have our cake and eat it too. 


©Doug Hill, 2012

2 comments:

  1. What If I was willing to argue that the decision the NASA engineers made was right?? Cold was not the cause of the accident. Roger was wrong and he was never dedicated or smart engough to figure out the true cause. Why was it that if the cold was the cause a failure occured at 15 inches of a 458 circumference? And more to the point the LH Aft field joint exposed to the same temperature and stresses sealed completely?

    ReplyDelete
    Replies
    1. I wouldn't argue with you -- wouldn't be qualified to. I based my piece on the sources mentioned. I was more interested in the decision processes involved than in who was right or wrong. In fact one of the main things I find interesting about this tragic affair is how difficult it is to know who's right or wrong, in the moment and in retrospect.

      Delete