Meta Failure

A Meta-failure is what happens when things go wrong because a set of underlying preconditions make an accident either likely, or inevitable. These sorts of accidents do not have simple causes like; corporate greed, negligent employees, or mechanical failure. Instead, the accident occurs because a set of deep underlying factors made it either possible or inevitable. In these sorts of accidents, looking at the shallow causes is not enough, we must take a deeper look. And, we must remember that the shallow causes are often used to provide cover for mismanaging managers. Understand, meta-failure accidents are not caused by simple things like the unnoticed mechanical fault in the subway car, or the negligent air traffic controller. Instead, meta-failures have multiple causes that may, in themselves, seem outrageous, and innocent perpetrators who were only doing their jobs. So, the unnoticed mechanical fault in the subway car might have been noticed a year before, but went unfixed because the transit authority could not get enough money from the state legislature for maintenance, or permission from the city council to raise fares. Or, perhaps the negligent air traffic controller had not been allowed adequate sleep for the last three months because of the work schedule put out by FAA managers in Washington, in response to training requirements imposed by a new DOT regulation, in response to a statute passed by Congress. The causes of these accidents can appear to be unrelated or weird, but linked together they can result in tragedy.

The Space Shuttle Challenger
The loss of the space shuttle Challenger is a good example of meta-failure. The Challenger was destroyed in 1986 because an o-ring seal failed in one of it’s solid fuel boosters. This allowed flames from the booster to burn into the shuttle’s main fuel tank. The tank exploded shortly thereafter. All seven crewmembers on board the shuttle were killed. Basically, four factors played a role in causing the accident.

First, were the higher leaders at NASA aware of the costs of their decisions?
The leadership at NASA had decided on a tough launch schedule for the shuttles; perhaps in response to political pressures, and perhaps without considering the real costs of the schedule. Understand, in any bureaucracy, requirements and attaboys come down, and yes sirs and flattery go up. People lower down the ladder adjust data to please their superiors. And sometimes, the people at the top let their subordinates know that they want to be pleased. After several levels of this, data can be remarkably distorted. So, the real costs of the schedule may not have been considered. Also, there might have been adverse consequences for anyone bringing up issues that could potentially set back the schedule. Remember, in any bureaucracy, people who won't “go with the program” tend to be regarded as “strange,” or even as “disloyal.” In any case, the investigation of the accident revealed that the schedule had put severe strains on the shuttle program.

Second, was anyone keeping track of safety data?
Several years before the accident an investigation of recovered solid fuel boosters found that the o-ring seals used on the boosters tended to be burned when the shuttles were launched in cool (less then 50°F) weather. Understand, the solid fuel boosters used on the shuttle are composed of several sections connected together. The sections are sealed by o-rings in their joints to prevent the hot gasses inside the booster from leaking out while the motors burn. After the motors burn out, they drop off the Shuttle and are recovered by parachute. In the investigation, the researchers found that the o-rings were leaking because they lost resiliency when cold. In some cases (prior to Challenger’s ill fated launch) the o-rings had burnt through during liftoff. However, in cases where the seals had failed, it happened late enough in the solid rocket’s “burn” to not endanger the missions. On the day of the launch of the Challenger no one was aware of this research. Someone should have been assigned to collect and analyze this data.

Third, dangerous political snags
Part of Challenger’s last mission was overtly political, the launch of Christa McAuliffe (the first teacher in space) into orbit. So, there were politicians and newspeople waiting to chat with Christa McAuliffe once she was in orbit. And, of course, there were administrators at NASA determined to make sure this happened without any glitches. Delaying Mrs. McAuliffe’s flight would have had negative political consequences for managers at NASA. Understand, there was nothing new about this. For example, the worst accident in the history of space exploration occurred on October 24, 1960. About 100 people were killed in the launch pad explosion of a Soviet rocket (the Soviet Union never released the details, this is based on the on James Oberg’s books, “Red Star in Orbit” and “Uncovering Soviet Disasters”). What happened was that a glitch in the rocket caused an abort just before launch. Safety regulations required that the rocket be de-fueled prior to being (safely) repaired. But that would have put the launch way behind schedule. The manager at the site, Field Marshal Mitrofan Nedelin decided to gamble on leaving the volatile fuel in the rocket while it was being repaired, hoping this would prevent any serious delay. He lost (both the rocket and his life). Why did he decide to put his so many lives at risk? Maybe he decided to do this rather then risk the political consequences of failure. A few weeks earlier, Nikita Khrushchev, then the leader of the Soviet Union, gave a speech at the United Nations hinting at new soviet space exploits in the near future. Back then, satellites were the latest thing in high technology, and the Soviet Union was out to show the world that it was at the forefront of this technology. The plan was to launch two Mars probes shortly after the speech. Both probes failed shortly after launch due to technical faults. Naturally, there were political repercussions. Field Marshal Nedelin might have been worried about the official reaction to his launch delay, coming as it did on top of two major failures. Could anything have been done to prevent disaster? Yes (they could have been more careful), and no. Unfortunately, as long as pioneering space missions are played out on television, they will be the stuff of politics. As long as the funding comes from politicians, just “saying no” to politics is not possible. So, Christa McAuliffe was inevitable. The accident that took her life wasn’t.

Fouth, the launch was outside of the engineering specifications
The last launch of the Challenger happened during one of the worst cold snaps to hit Florida on record. I don’t think anyone anticipated the shuttle would be launched in weather that cold. After all, the launch pad was located in Florida, and the only alternate was Vandenberg in southern California. Now, some of the engineers at NASA, and Morton Thiokol (the contractor who built the solid fuel boosters, now called Thiokol due to a corporate split up), knew there was a danger in launching the shuttle in weather so cold. But, no one told the managers at NASA (perhaps because they had to “go with the program”). Also, it might have been that the leaders at NASA had let it be known that they didn't want to hear bad news. But, that aside, did anyone realize that there was a risk simply because the weather was so cold. That launching in weather that cold had never been anticipated. In the end, Christa McAuliffe (the first teacher in space) who had to be launched on time. So, they crossed their fingers and let the shuttle be launched; and seven people died in an accident that was totally preventable.

What could have been done to save the Challenger?
There was not any one thing that caused the Challenger accident. The accident was caused by a combination of factors that were in place years before the failed launch. While in the very short run the accident can be blamed on defective o-ring seals; in the long run, it was caused by a series of meta-failures that all came together on the day of the launch. But, even as late as the day of the launch, there were things that could have saved the shuttle Challenger, if only someone had been able to convince the managers at NASA that there was a danger in launching the Shuttle.

In the short term, emergency action
The launch of the shuttle could have been delayed, and a plausible excuse invented for the news media. Perhaps the icicles hanging off the shuttle, tower, gantries, and just about everything else (this was one of the worst freezes on record) could have been invoked as a potential danger. For example, what if ice froze to the ceramic tiles (used as a heat shield during reentry) and caused some of them to break off during launch? Actually, this was a serious possibility, and there were other dangers as well. Also, this would have had the right impact on the news reporters and politicians who were waiting to chat with Christa McAuliffe in orbit. Or, someone could have brought up, anonymously, the possibility of rainwater freezing inside the joints of the solid rockets. While in engineering terms, this may not have been a serious possibility, it would have would have provided a simple reason for delaying the launch, while heading off an embarrassing deeper look. And, a simple reason could have been given, “While we think this hypothesis is totally implausible, we have to investigate it just to be safe.” Finally, they could have simply told the truth, that there was a problem with the temperature response of the o-rings. And, a simple reason could have been given to explain why the o-ring problem had been ignored, “We never expected to have a shuttle launch in weather this cold.” The problem was that the truth might have embarrassed NASA. So, finding a deceptive explanation might have been more bureaucratically feasible. Note, while I think the morality of deception is suspect in all cases. In this case, deception would have been necessary to save lives, and because of the culture of the bureaucracy.

In the medium term, an engineering fix
The faulty o-ring seals could have been fixed years before the accident. If only someone had mentioned (without appearing “strange”) the problem to a manager with enough seniority to get money allocated for a fix. The modifications to the seals could have been done quietly, and relatively inexpensively. And, if anyone asked, the work could have been explained as, “Making an ‘ultrasafe’ shuttle orbiter even safer.”

In the long run, design and policy
Twenty years before the accident, Morton Thiokol could have been asked to build a plant either in Florida (next to Kennedy Space Center), or at a site that allowed the boosters to be shipped by barge. Why? Because the reason the joints and the o-ring seals were in the solid fuel boosters in the first place was because the manufacturing plant was located in Utah. Because of the plant’s location the boosters had to be broken down into little sections and shipped to Florida by rail, then assembled prior to launch. If the Thiokol plant had been located on, say, the Missouri river, then the boosters could have been built in big pieces and shipped by barge to the Kennedy Space Center (via the Mississippi, and the intercoastal waterway). Also, there are safety issues in shipping by rail. Sometimes vagrants ride on railroad cars, Sometimes railroad shipments are vandalized. And, sometimes fools shoot guns at trains. These sorts of things are much less likely to happen to a barge in the middle of a river. Ultimately, the reason a new plant was not built was budgetary. Congress had limited the funds NASA had available for the Shuttle. And, Morton Thiokol was, after all, the low bid contractor (and maybe the only contractor capable of building those boosters). A properly located shuttle plant would have been expensive, and would have duplicated the other plants Thiokol already had for making military rockets. Understand, most of Thiokol’s rocket business was (and is) making military boosters. So, budgetary logic prevailed, rather than engineering sense.

Conclusion
The Challenger accident cannot be blamed on “pilot error,” or other forms of human error, or software failure, or corrupt contractors and politicians. It was caused by a series of meta-failures that came together in a way that made the accident inevitable. There was that tough launch schedule that had to be maintained. The faulty o-ring seals could have been fixed years before. And, there was Christa McAuliffe, the first teacher in space, who had to be launched on time. So, if you ask, “Who is to blame for this accident?” It’s hard to say, choose your suspect; everyone, no one, the system. Bureaucracy specializes in distributing responsibility this way. In the last analysis, long term political and budgetary decisions, in some cases made decades before the accident, and short term bureaucratic behaviors made the shuttle accident inevitable.

Preventing this in the future
I think there is something that can be done to prevent these sorts of accidents in the future. NASA should create a group of engineers to independently compile and study safety related data. The engineers should have the political clout, independence, and access to get their jobs done quietly. They should have the power to independently review plans and programs. They should only be accountable to NASA’s chief administrator. And, they should have the power to intervene when they find potential problems. Bureaucracy being what it is, some at NASA will object. Understand, in the government your career can be blown at any seam, even if you are not really responsible (it’s called the zero defects mentality). So, people in government have a natural fear of outsiders coming in and nosing around. Understand, a bureaucracy is a tight community, like a tribe, so there is a collective psychology at work. If something is a threat to you, it’s a threat to me. If you mess up, you bring shame on us all. Bureaucracy is an emotional amplifier, things like shame, paranoia, or prudery can become dominant motivators. To make things worse bureaucracy is a regulator, and tends to see itself as a collective parent. So, ideally, the bureaucracy should be an omniscient (all knowing) parent, and the regulatee should behave like a three-year old child. This is why bureaucracies have a hard time regulating themselves. Simply, because self regulation violates the unwritten rules that guide all human communities. How can you, as a member of the community, treat other senior members like three-year old’s? It just isn’t done. And, admitting a mistake, is admitting that the system (your tribe) is not omniscient (that brings everyone down). Now, NASA launches space missions that cost the taxpayers hundreds of millions of dollars. And, that makes NASA its own safety regulator. And, that is the problem. That’s why there must be independent safety oversight. And, that’s why faults have to be found and fixed quietly, if possible. But, whether these things are found and fixed quietly or not is a secondary issue. These accidents are very expensive, economically, politically, and sometimes in human lives. Preventing these accidents must be a primary goal at NASA.

I hope you enjoyed reading this.

Back to Science Page

Back to Front Page
 


The Utah Rocket Factory

Thiokol built their Utah plant decades before there was a shuttle program. When Thiokol built the plant, they probably had no idea that their solid fuel rocket technology would be playing a major role in NASA’s manned space program. At that time, the primary customers for Thiokol’s solid fuel rockets were the military. As far as I can tell, the plant’s first products were boosters for ICBMs, in the late 50’s and early 60’s. The plant was probably sited in Utah to be close to the military depots and test ranges in the southwest. <return>
 


Was Thiokol the Only Possible Contractor?

If this is true in the case of Morton Thiokol, then perhaps I should say that they were the “no bid” contractors.

The Pentagon should be famous as a creator of monopolies. The reasons are simple. First, military technology is so expensive, and capital intensive, that only the richest firms can afford to compete. Second, this technology is so esoteric, that only a few people really understand it. Third, the military’s contracting practices are so competitive, that the losers are driven out of the business entirely. Now, if you have the plant, and have the people who can do the technology, and have won the last few rounds of bidding; then when the military wants something, they come knocking on your door (and pay your price). That’s right. If they need the technology, then they need you. Why? Because, the military’s system “competitive” bidding created a situation where there is no effective bidding at all. Because, the system let you own the technology. Now, the government may try to hold a competition for the next big contract, but it will be a sham. Why? Because your competitors, if they win, still have to build a plant, and find experienced people, and negotiate with you for the right to use your patents. So, even if you lose, you win. And, if something goes wrong with the technology you sold the military, there isn’t an awful lot they can do about it (as long as the defect isn’t too flagrant). Why? Because, they still need you (and, of course, have to pay your price) if they want to replace the broken stuff you sold them. There are two possible solutions to this problem. First, Congress could nationalize your company (and to pay you, and pay your stockholders). Or, Congress could change the way contracting is done for big ticket technology. The contracting system could be redesigned so that no one loses too much. That is, let the winners become the main contractors, and the losers become the subcontractors. That way, monopoly can be prevented because several firms would always have a hand in the technology. Also, firms that do business with the Pentagon might be required to participate in a patent sharing system with other Pentagon contractors. That is, firm A gets to use firm B’s patents (for military work only) provided it pays a standard percentage royalty. That way, firm B is compensated for it’s creative work, firm A gets to use the patents for a reasonable rate, and monopolization is prevented. Some economic pundits might have fits with these proposals. But the issue here is simple; either have a contracting system that works at least halfway, or a system that doesn’t work at all. The alternative is between a system fosters creative competition, or a system that inevitably creates expensive monopolies of either the capitalist or the socialist variety. <return>
 


The O-Rings Seals

The problem with the o-rings had to do with the original specifications. Apparently, no one thought the o-rings could get as cold as they did. Thiokol was building the rockets to spec. However, modifying the seals was a relatively inexpensive procedure. The moral issue emerges when we look at what the people did once they knew there was a potential problem. So, Why did the managers at NASA ignore data that indicated that there was a problem? Did anyone tell them? <return>
 


The Loss of the Mars Climate Observer

A good example of how a system of independent safety investigation might have prevented an accident can be found by looking at the loss of the Mars Climate Observer. The probe was lost due to a navigation error. The contractor, Lockheed Martin, was sending thruster data in pounds force, while the navigation team at JPL had programmed their computers to receive data in Newtons thrust, and nobody noticed (why?). But, there is more to this story. I think the road to this failure really started when Congress passed the Omnibus Trade and Competitiveness Act of 1988. One of the provisions of this bill mandated that the Federal Government convert all of it’s business and contracting to metric units. The idea was that the Federal Government would force American businesses to convert to the metric system. Unfortunately, no one realized that this meant that most of the incredible costs of the conversion would be borne by the Federal Government, in the form of higher costs for everything. Congress ended the program when the bills and cost estimates started coming in (for more on this see Metric Conversion, Not?). Unfortunately, the program to build and launch Mars Climate Observer began just before the start of conversion effort, and continued after it’s end. So, the program caught the full chaos imposed by the conversion efforts. And, when the probe was launched, there was a problem ticking away. Understand, the problem didn’t have to be ticking away; proper software control procedures at NASA would have caught it. And you cannot really blame the Lockheed Martin; NASA never specified the units it wanted in the contract, and afterwards no one asked what units they were using. But the story didn’t end there. Years later, when the project engineers noticed that the navigation data was odd, they tried to tell the NASA managers that they thought something might be wrong because the course corrections were too big (compared to other missions). The managers, apparently, dismissed their warnings because all the engineers had were strange data and hunches. The NASA managers assumed that all was well because nothing had gone obviously wrong. Understand, and no one wants to send negativity up the chain of command; if you’re wrong people will be calling you a fool, if you’re right then you’re the bringer of bad news (in either case, your status and your career might be damaged). Then, the real bad news came in, that the Mars Climate Orbiter was lost.

What if?
What if the engineers had someone to go to, anonymously, outside of management, with the power to act? What if someone had been in the position to independently look for possible problems in the months before the probe was lost? Perhaps the fact that the thruster data from the probe was being reported in pounds, but newtons were being used to calculate the course corrections might have been noticed. Perhaps in time for something to be done. But, that would have required someone empowered to ask embarrassing questions. <return>
 


Sources

I am indebted to the following sources for inspiration and some of the material that appears in this article.

Books
Red Star in Orbit, by James Oberg, published by Random House in 1981
Uncovering Soviet Disasters, by James Oberg, published by Random House in 1988

Magazine Articles
“Why The Mars Probe Went Off Course,” by James Oberg, in Spectrum Magazine, December 1999

Internet
Mark Wade's Encyclopedia Astronautica, at http://www.rocketry.com/mwade/spaceflt.htm
 

December 3, 1999 - May 30, 2000


Copyright © 2000 by George A. Fisher