WHAT WE DO


JOINRENEWJOIN

Year in Space Calendar
 

The Planetary Society Blog

By Emily Lakdawalla


Spirit Sol 18 Anomaly

Sep. 22, 2006 | 09:40 PDT | 16:40 UTC
We need your help.
Please donate to support our blog, website, and podcast.
RSS 2.0 News Feed

by Mark Adler

In a previous blog entry I said I'd talk about what happened on Spirit about a week after the President announced the Vision for Space Exploration. So here it is: a little window into what it's like to operate a priceless national asset.

It was January 21st, 2004, 18 Martian days (sols) since Spirit landed successfully on Mars and about a week since Spirit successfully drove off of the lander onto the surface of Mars. Everything was going so incredibly well, we couldn't believe it. It's really odd to have the rover work so much better on Mars than it ever did in test. Spirit was engaged in real geological exploration on the surface of an alien world. We figured we were the luckiest people on Earth.

Well, our luck was about to change.

Jennifer Trosper and myself took turns as the tactical mission manager for Spirit. Jennifer was on duty for Sol 18, and it was an off day for me. I came to JPL anyway around noon for an interview for a documentary. As I was coming out of the interview, I ran into Steve Squyres, the Principal Investigator for MER. He was going in for his interview appointment. As he sees me coming out, Steve says "Have you heard about Spirit?"

Just from the question and Steve's serious tone of voice, I instantly awakened from my partially sleep-deprived state and my eyes flew wide open. "What do you mean" I asked, ""Have I heard about Spirit?'" Steve said that we did not get any signal from Spirit when we expected to, either from a high-gain antenna communication session direct to Earth, or from the subsequent relay communication session to the Odyssey orbiter.

Oh boy.

If only the high-gain session were missing in action, that could easily be due to weather, Deep Space Network problems, lots of things. It turns out that communicating over a few hundred million miles isn't easy, and you often have problems. But our experience up to then was that the relay sessions, only a few hundred miles to the orbiter, always worked. The Odyssey orbiter itself was working fine. For the relay to also fail right after the high-gain failure gave a very strong indication that the problem was in the rover, and that it was serious.

Space missions are risky. We're used to that. We especially think a lot about the most risky parts. For MER, the biggest risks were, in order, entry, descent, and landing, or the six minutes of terror as we would call it, with the next most risky being the launch from Earth, and finally the post-landing events which was driving off the lander and the rover unfolding and cable cutting that preceded that.

Launch, landing, and egress. All of them nail-biters.

We had made it through that with Spirit. All the risky stuff was behind us, we figured. From here on, with proper care and attention of course, there were no more big risks. It should all be gravy and smooth sailing. Which made this all the more alarming. What the heck happened?

I immediately headed over to the operations area, in which I and many others spent the next three days almost continuously. There were no off days at that point. Jennifer was the planning mission manager and I was the tactical mission manager for the next three sols. Planning is what you do during the Martian night, and tactical operations occur when the rover is awake during the day. So Jennifer and her team were figuring out what to do, and I and my team were doing it. Or at least trying to.

On Sol 19 we simply attempted to communicate with Spirit to see if we could get some data back. Before the tactical shift started, per tradition, I played a song in the mission control room related to the day's activities. For Sol 19 I played "S.O.S.", by Abba. And that's about all we got out of Spirit that day. Several attempts at communication that all would have worked fine normally instead all failed, except for one commanded beep. A beep is simply the rover turning on the radio carrier for five minutes, which we detect at Earth. While there is no data on the carrier, the carrier itself provided one vital piece of information: Spirit was still there. It wasn't completely dead. Though we didn't get any data, we called that a good day. I ended my mission manager report with the optimistic plan: "In the long term, restore the state of the vehicle, diagnose and correct what happened, and return to normal operations." In fact, I ended my reports for Sol 20 and 21 with the same words.

On Sol 20, we made an even more concerted attempt to coax data out of the rover. Without the data, we'd have no idea what to do, or try to do, to recover normal operations. It took many attempts and variations on the approach, but we managed to get actual data modulated on the carrier. Much of the data was repeating gibberish, itself a mystery, but we did get a complete enough packet to get some health information on Spirit. Getting the data was fantastic, but the data itself painted a bleak picture.

What we saw was much higher internal temperatures on the rover, and a much lower battery voltage than we expected. Those two together were a clear indication that the rover was not going to sleep like it was supposed to. Normally the rover computer is on only five or six hours out of the day. That conserves the precious solar energy stored in the battery, and also keeps the rover from overheating.
Well Spirit wasn't sleeping for long enough, or possibly not sleeping at all.

What we had on our hands was one sick rover. Spirit had insomnia, a fever, was getting weaker all the time, was babbling incoherently, and was largely unresponsive to commands.

Not good. We had one rover on Mars dying, and two days later Opportunity was going to go through it's risky entry, descent, and landing. Before the weekend we could easily go from having two Mars rovers to having none at all.

While we still had time to command Spirit on Sol 20, before the Earth set at the landing site, our priority was to get the rover to go to sleep. We planned the shutdown command to hit the rover while it was in a communications session. That way we could see the session end prematurely, which would verify that it got the command. So we sent the SHUTDWN_DMT_TIL command. That stands for shutdown dammit until, which is followed by the time we want the rover to wake up. The "dammit" means an immediate emergency shutdown without regard for whatever activities happen to be running on the rover. (There was a little bit of humor in the naming of the commands.)

We sent the shutdown dammit, and sure enough, the communications session ended as soon as the command hit the rover. So it started the shutdown. Whew. We got Spirit to go to sleep. To verify this, we sent a beep command. We shouldn't get a beep in return, since a sleeping rover can't receive or respond to a command.

We got the beep.

What the ...? Spirit was supposed to go to sleep! Well, it didn't. Spirit was going to burn the midnight oil for yet another night. The Earth set at Gusev Crater, and we had to wait until the next day to try more commanding. In the meantime, Spirit's battery would continue to head downhill, and the electronics would get even hotter inside. We were running out of time.

Sol 21. We had a plan. The prevailing theory, at least the one we could do something about, was that the rover computer was in a "reboot loop". The response of the software when it encounters a problem it can't solve is to reboot. Just like you would do with your computer when it seizes up. Since no one is there to press the reset button, the rover does it automatically. However if the software encounters a problem while rebooting, then it's stuck rebooting forever. The designers had the foresight to put a delay in the reboot cycle so that there was some time to try to talk to the rover between reboots. That would explain the intermittent commandability, since whether the rover responded to a command would depend on when the command hit in the reboot cycle.

The idea behind the reboot is that everything in the software starts over from square one, so that whatever caused the problem in the first place should be gone the second time around. But in this case, the problem persisted. So Spirit was remembering something between reboots that was the cause of the problem. That pointed to either the flash memory (like in your digital camera), another smaller piece of memory that retains its contents called the EEPROM, or a hardware failure. The flash memory on the rover is used like the hard drive on your computer -- the file system is kept there.

Again, the brilliant designers had built in a back-door for us. There was a way to get the rover to reboot without ever looking at the file system on the flash. The radio that receives commands from Earth has built into it the ability to decode a few commands all by itself, called hardware commands. It doesn't need the computer at all to figure out those commands and execute them. One of those commands sets a flag to tell the computer to not use the flash file system when booting. Another one of those commands hits the reset button on the computer to force a reboot.

So that's what we tried on Sol 21. It took a few attempts, but we eventually and victoriously rebooted Spirit into a somewhat sane state where it was responsive to commands and not babbling. What a relief that was. We retrieved some power history data for that last few days, removed some planned relay communication windows, and finally gave Spirit a long deserved and badly needed sleep. This time it worked.

Now we had the secret sauce to get Spirit working. The rover would still wake up every morning into the reboot cycle, but we could quickly send up the necessary commands to boot without the flash file system. Which is what we did for several more days. We had won the race with time, and now we could carefully and methodically figure out what happened, fix it, and go on with the mission.

So I went home and went to sleep. Very easily I might add. My alarm went off about five hours later. Why? So that I could go back to JPL for the Opportunity landing that night! Hours after we got Spirit under control, Opportunity came screaming into the Martian atmosphere at 12,000 mph. Opportunity landed successfully, and we were back to confidently having two rovers. Now with both of them safely on Mars. Wow, what a ride.

The end of Sol 21 was the turning point in the recovery. It still took another two weeks to complete the diagnosis, fix the problem which involved effectively reformatting the hard drive (the flash), and restoring Spirit to full science operations. As we were recovering some of the data taken before the anomaly, we got back this beautiful color picture of the U.S. flag on the Rock Abrasion Tool (RAT) on the arm. That flag was on a metal shroud that was made from scorched and torn remnants of the World Trade Center towers. The RAT was designed and built in Manhattan, blocks from the WTC site. We put that picture of the RAT with the stars and stripes on the big screens in our mission control room, and I played our national anthem. Everyone stood with hand on heart. It was a good moment.

U.S. Flag on RAT
Credit: NASA / JPL / Cornell


Spirit has operated just beautifully ever since, aside from some recent signs of old age, like a bum wheel motor. As I write this, Spirit is on Sol 967. Nine hundred and sixty seven?! Wait, that can't be right. Let me check that ... Yep, it's right. Incredible.

You may be wondering what caused the Sol 18 problem in the first place. We eventually figured out that there was simply an error in the software on the rover that we didn't catch in test. Some memory was getting used up more and more each day as we collected data. On Sol 18 that fixed block of memory got filled up, and the boot process failed while trying to read the file system. We actually did think about the sort of failures that can occur in the accumulation of many sols of operations. To scare out those sorts of problems, we ran a 10-sol test before landing. But we didn't run an 18-sol test. Well, not until Spirit ran the test for us on Mars, that is. We will probably run into a bad software problem some other day on some other spacecraft, but I can guarantee you, we won't let that kind of bug get us again.



Emily's on Twitter! »

Sign up for email updates!
Email address:
(optional) Your name: