The Mars Pathfinder mission, while widely
proclaimed to be flawless, actually had a small
bug in the scheduling code which handled tasks
while the craft was operating on the surface of
Mars. This bug caused intermittent computer
resets, each of which resulted in the loss of
data.
The Pathfinder computer handled three tasks; a
high priority bus management thread which was
responsible for moving data on and off the
information bus (basically a shared memory area),
a medium priority communications thread, and a low
priority thread responsible for gathering
meteorological data and publishing it to the
information bus.
The low priority meteorological thread would
occasionally lock a mutex (mutual exclusion lock)
protecting access to the information bus so that
is could publish it's data. The reset bug was the
result of a deadlock that occasionally occurred
when the communications task was scheduled in the
short interval when the meteorological thread held
the information bus lock, blocking the high
priority bus management thread. Since the
communications task was long running and had a
higher priority than the meteorological task, it
prevented the low priority task from running. But
that task still held the mutex, preventing any
other task from completing its work. After the
information bus task hadn't executed for a given
period of time, a watch dog timer would go off and
reset the computer.
The Pathfinder computer was a victim of
priority inversion. Priority inversion occurs when
a the execution of a high priority task is
prevented by a low priority task. In Pathfinder's
case, the low priority meteorological task blocked
the medium priority communications task by holding
the mutex to the information bus.
The problem was identified and fixed by JPL
engineers who worked with a computer identical to
the one on Pathfinder.
Most of this information comes from various
accounts of a talk given by David Wilner, Chief
Technical Officer of Wind River Systems. Wind
River's VxWorks was the RTOS which ran on the
Pathfinder computer.
http://catless.ncl.ac.uk/Risks/19.49.html