EXPLORE


JOINRENEWJOIN

Year in Space Calendar
 

Planetary News: Mars Global Surveyor (2007)

Human and Spacecraft Errors Together Doomed Mars Global Surveyor

By Emily Lakdawalla
April 13, 2007
Mars Global Surveyor
Mars Global Surveyor
Credit: NASA / JPL (art by Corby Waste)

A preliminary report released today by an internal review board has determined that the loss of Mars Global Surveyor probably had nothing to do with aging of the spacecraft hardware, but instead resulted from mistakes made by both the human operators and the spacecraft's onboard fault protection software.

According to the report, a chain of events beginning in September 2005 resulted in incorrect data being stored in two key parameters in the spacecraft's memory in June 2006.  As a result of reading that data in November 2006, the spacecraft attempted to over-rotate a solar panel, and incorrectly determined that the panel's failure to move was a result of a hardware problem.  Then, in attempting to recover from a problem that didn't actually exist, the spacecraft rotated into an orientation that caused one of the two batteries to be pointed at the Sun.  The battery overheated, and within twelve hours, the spacecraft drained the remaining battery and fell silent forever.  Now, all that NASA can do is apply the lessons learned from the review of the Mars Global Surveyor failure to its other missions.

More On Mars Global Surveyor


Explore More


The Human Mistakes

The problem began in September 2005 with an update of a parameter describing the pointing of Mars Global Surveyor's high-gain antenna (HGA) by the spacecraft operations team at Lockheed Martin.  The spacecraft had two redundant control systems, and parameters describing the hardware were supposed to be maintained identically on these two systems.  However, the HGA parameter was updated on the two systems at two different times by two different operators, with the result that the parameter was given slightly differently on the two control systems.  The difference was too slight to actually cause a problem to the spacecraft, but because the parameters were supposed to be maintained identically, it was determined by the spacecraft operations team that the parameter should be corrected to be consistent on both systems.

Unfortunately, in the effort to correct this minor error, the spacecraft operations team created a much more serious problem.  When the identical and correct HGA parameters were uploaded to the spacecraft, the operations team incorrectly specified the location for the new parameter in the computer's memory.  Because the wrong memory location was specified, the new parameter was written over the end of one and the beginning of a second parameter being stored in onboard memory, corrupting both parameters.  The two parameters were:

  • A number that described the HGA's pointing direction used during contingency operations; and
  • A number that specified the angle past which the solar arrays could not be moved.

Mars Global Surveyor's computer now had garbage information for its knowledge of both these numbers.  Unbeknownst to the mission team, these incorrect parameters represented a hammer poised above the spacecraft.  The two incorrect parameters could cause Mars Global Surveyor to attempt to move its solar arrays past their design limits, and to point its antenna in the wrong direction during contingency operations.  From June to November, the corrupt parameters sat in memory, undiscovered by either human operators or by the spacecraft.

The Spacecraft's Mistakes

On November 2, 2006, the hammer fell.  Just before a routine communications session between Mars Global Surveyor and the Deep Space Network, a command was sent to the spacecraft to change the orientation of the solar arrays, in a routine (though infrequently called) procedure to prevent thermal wear and tear on one of the spacecraft's components.  But when the communications session began, the spacecraft reported many alarms.  Telemetry indicated that one of the motors driving one solar array was stuck, and that the spacecraft had switched to backup systems.  However, all indications were that, although a serious hardware fault had occurred, the spacecraft was coping with the problems safely and correctly.  The spacecraft operations team, working on the information that had been transmitted by Mars Global Surveyor about the fault and its responses to the fault, quickly went to work to analyze the problem.  The communications session ended when Mars Global Surveyor went into eclipse, traveling behind Mars as seen from Earth.

That 13-minute communications session was the last that Earth would ever have with Mars Global Surveyor.  The review board, led by Dolly Perkins, Deputy Director-Technical of Goddard Space Flight Center, reported the likely sequence of events that followed this communications session during a press conference held at NASA Headquarters today.  According to the report, Mars Global Surveyor was not correct in its assessment of its own health.  In fact, the motor reported as "stuck" had been working correctly, but due to the garbage parameter in the computer memory, the spacecraft had been attempting to move the solar array past its stopping point.  The spacecraft incorrectly interpreted its inability to move the solar array to be a problem with a stuck motor.  It quit commanding this motor and entered contingency mode, during which it rotated its solar panels to face the Sun.

But there was a serious problem. Mars Global Surveyor had not only pointed its solar panels at the Sun, it had also pointed one of its two batteries at the Sun.  Perkins said, "it oriented that way because it was trying to do two things: one is to maintain enough power to the spacecraft, and the other is to achieve communications with the Earth.  It toggled between those two modes.  The particular orientation in this case, when it was trying to retain power, exposed the battery to the Sun.  It's because of how the onboard fault protection handled the event."

The battery was not designed to be pointed at the Sun; it overheated.  The spacecraft detected this condition, but was not programmed to correctly interpret its cause.  Instead, the spacecraft computer's fault protection software determined that the battery was merely overcharged.  Instead of turning to take the battery out of direct sunlight, it stopped charging the battery at all.  As a result, only one of the two batteries was being charged, and one battery was not enough to maintain power to the spacecraft.

In the middle of this sequence of events, the other garbage parameter in the spacecraft's memory reared its ugly head.  Before entering contingency mode, the HGA had been correctly pointed to communicate with Earth.  But as part of the spacecraft's contingency operations, it read the bad HGA parameter from memory, and turned the HGA away from Earth -- in fact, the HGA was most likely pointed not at Earth, but at the spacecraft itself.  That error made communications with the ailing spacecraft impossible.

Mars Global Surveyor was still alive, but operators were powerless to rescue it.  "When [the operators] made the contact with the spacecraft and they received the alarm, they knew something happened, but on the other hand the telemetry was still reporting that the spacecraft was stable and was operating nominally.  That kind of thing had happened before." Perkins said.    "It was a very short contact, a total of 13 minutes, so there wasn't sufficient time to react at that time.  Subsequently, they did not have communications, so there was no ability to react afterward."

Quickly and inexorably, Mars Global Surveyor depleted both batteries, most likely within five or six orbits (ten to twelve hours) after the initial fault.  Without power, Mars Global Surveyor ceased to function after ten long years in space.  It is now silent and unrecoverable, still circling Mars every two hours.

Lessons Learned

The review board found that all of the procedures that had been in place to run the Mars Global Surveyor mission so successfully for so long had been followed correctly.  Unfortunately, these procedures could not -- and clearly did not -- catch the memory-writing error that had been made by spacecraft operators in June 2006, and could not predict the unusual combination of circumstances that eventually led to the loss of the spacecraft.  "The particular procedures that were in place didn't have as much rigor in them as you could have expected," Perkins said. "For example, they didn't require people to go back to this very basic and not-always-easy- to-interpret set of tables that show everything that's in the memory on the spacecraft to verify what was going on.  There wasn't any obvious thing between June and November that would have led someone to discover [the faulty parameters].  Among our findings is that…there were no systems in place that would have discovered it and brought it to the attention of the operations team."

So, ultimately, the cause of the loss of Mars Global Surveyor was procedures that were not thorough enough to catch this particular problem.  In addition, the spacecraft did not have onboard fault protection software that could detect and solve the memory problem, or detect and correct that the overheating of its battery was caused by its orientation to the Sun, nor did it send enough information to Earth for operators to be able to diagnose and correct the problems in time.

Spacecraft with long lives experience additional risk not only because of aging hardware, but because of loss of personnel.  Mars Global Surveyor had just entered its fourth extended mission.  Extended missions are typically run with fewer staff than primary missions.  This is possible in part because of experience derived from operating the spacecraft for so many years.  However, when experienced personnel move away from the mission to new missions, that experience is lost, and new workers must be trained.  In addition, the spacecraft themselves have idiosyncrasies that come with age.

The main lessons from the loss of Mars Global Surveyor thus have to do with the safe operation of missions that have long outlasted their originally intended lifetime, Perkins said.  "Over time, as a mission is aging, it is beneficial to step back and look at the current state of the spacecraft from end to end, to see what risks have been induced over time just by the aging of the spacecraft and what this might mean in terms of operational changes that ought to be made in terms of managing the spacecraft.  We found that these were not always completely assessed.  We found that the training for the online operators, and system engineers, and key people in the mission was excellent, the methodology was excellent, but it was not uniformly applied to some of the offline people that supported the spacecraft."

New wind streaks forming in Gusev crater
New wind streaks forming in Gusev crater
This animation flickers between two MOC wide-angle camera images of Gusev Crater, Spirit's landing site. The two images were taken on March 5 and March 12, 2005. In the week between these two images, a prominent new set of wind streaks formed within Gusev crater. Ongoing and repetitive study of sites like these are among the things lost with Mars Global Surveyor. Credit: NASA / JPL / MSSS / Daniel Crotty

What lessons does the loss of Mars Global Surveyor offer to other missions?  As far as missions to Mars are concerned, said the Jet Propulsion Laboratory's Mars Exploration Program Manager Fuk Li, "Some of the actions that we will take will be technical in nature, like operational procedures.  But some of the actions that we will take will be more cultural in nature, to ensure that over these long operational lifetimes of some of these missions, that the team can stay fresh, can be focused, and can be vigilant to make sure that the mission can continue to operate successfully."

Doug McCuistion, Mars Exploration Program Manager for NASA Headquarters, said that the specific lessons learned from the review of Mars Global Surveyor's failure would be disseminated widely, not only to Mars missions but to all of NASA's other science missions.  Furthermore, he said, they would seek improved procedures from outside NASA's Science Mission Directorate.  "These kinds of extended operations are something that, for example, the Space Station folks at Johnson [Space Center] do regularly.  So we're going to learn from them."

The tragic loss of Mars Global Surveyor reminds us that computers are no better than their programmers.  And humans are prone to error.  No matter how many procedures are in place to protect our spacecraft, there will always be unanticipated sources of error, and sequences of events that can end a spacecraft's life.  Mars Global Surveyor has left behind a rich legacy, nearly a decade's worth of data from Mars.  We'll never know how much greater that legacy could have been.

Did you like this article? Send it to a friend or share it at:
Digg this - Reddit - Del.icio.us - Newsvine - NowPublic