Software Failures

“A learning experience is one of those things that says, ‘You know that thing you just did? Don’t do that.’ “
The Salmon of Doubt Douglas Adams

In this module we will discuss software failure, highlight some examples, and discuss how to view your own code when it fails.


Contents


How bad is it, doc?

How prevalent are software failures? According to the 2020 CHAOS Report, only 31% of software projects are successful! That number sounds awful! How do any of us have jobs still?

First, it should also be noted that the CHAOS Report has seen significant criticism on methology and accuracy, so it’s numbers shouldn’t be taken as gospel. Second, this depends on your definition of success and failure. The CHAOS report historically defines successful as a project being delivered on time, on budget, with all expected features. Failing any single one of those criteria, let alone multiple, causes the project to no longer be considered successful. A project that delivers working software but doesn’t meet all three of these criteria is considered “challenged.” And regardless skepticism of the CHAOS report itself, software is frequently late, over-budget, or missing key features.


Examples of software failures

What follows is short list of select software failures. This is not even close to an exhaustive list (such a list would make War and Peace look like a short afternoon pleasure read), but should capture the damage some failures can cause:

  • 1985-1987 - Therac 25, a radiation therapy machine, was an “improvement” over Therac 20, because it replaced hardware safeguards to prevent incorrect usage with software safeguards, which were seen as more reliable at the time. A “race conditions” bug in the operating system of the radiation therapy machine resulted in people receiving direct radition from a high intensity laser. The laser was supposed to hit a metal plate which produced X-Rays, with only the X-Rays reaching the patient. However, if the operator typed quickly, it could result in the metal plate not correctly being positioned in the path of the laser. This happened several times, and at least 5 patients died.

  • 1999 - The Mars Climate Orbiter is lost in the Martian atmosphere (contrary to popular retelling, the Mars Climate Orbiter almost certainly didn’t “crash into the planet”, but rather burned up in the atmosphere, or skipped off the atmosphere and out of the Mars gravitational sphere of influence). The primary cause of the failure was ground software produced by Lockheed Martin. The software system to calculate thrust was specified to produce SI units (newton-seconds). However, the software instead calculated thrust in pound-seconds (meaning the calculation result was incorrect by a factor of nearly 4.5). While NASA didn’t write the software, they also failed to detect the error, and ultimately claim responsibility. The project cost over $325 Million.

  • 2014 - The United Kingdom’s NATS (National Air Traffic Control Services) Swanick control center had a full-system failure. Multiple airports (including Heathrow, the busiest airport in Europe) had to cancel flights, close runways, and divert planes. The bug was because of an incorrect capacity number of Atomic Functions, written in code as 151, rather than the correct number of 193. The bug had likely existed since the 90s, but never manifested.

  • 2021 Facebook developers take down their website running a command to assess the capacity of their “global backbone”. The global backbone is tens of thousands of miles of fiber optic cables and routers connecting their data centers. The command is flawed, and while there was a safeguard system is in place to prevent bad commands causing failures, there was a bug in the safeguard. All of Facebook’s data centers were disconnected, and as a result the Facebook DNS server withdrew their routes (meaning the internet didn’t know where to send you when you typed “facebook.com”). Facebook’s internal communications (emails, directories, Zoom services) all were cut-off. Reportedly, physical access to Facebook facilities was also cut-off, as the security badges didn’t work without the server to authenticate them.


Why does software fail?

As computer hardware has grown cheaper, easier to distribute, and able to utilize more sensors, like cameras, scanners, etc., software has become more and more common. Large scale software is an engineering feat just like building a large bridge.

The Golden Gate Bridge, picture taken by Courtney Hill

In engineering, a key aspect of safety is redundancy. The Golden Gate Bridge’s main cables consist of over 27 thousand galvanized wire cables. If any single one of those cables somehow fails, the overall bridge is still very safe.

Does that redundancy apply to source code? Well pick some source code you have written, and randomly delete one letter somewhere in your program. Chances are, your code doesn’t run anymore. However, unlike the Golden Gate Bridge’s cable, replacing that character to fix the problem is virtually free (hit the undo button).

While software is very brittle, it’s also very correctable. What we need are the tools, techniques, and systems in place to help us prevent, detect, and fix errors before they cause harm.


Learning From Failure

“Though you should not fear failure, you should do your very best to avoid it.” - Conan O’Brien commencement speech at Dartmouth in 2011

Part of learning is failing. Everyone will fall off their bike and skin their knee while learning to ride. Everyone plays awful sounding noises when they are first learning an instrument. In this same way, we need to learn how to build software accepting that our first outing won’t win any Turing Awards. However, the earlier we find failures in the software process, the better.

As a final takeaway, it is not a software failure when a defect is found by a developer. Software fails when the defect isn’t found. A test failing is a success, because it means you found one more defect, which is the first step to fixing it.

“The Guide says there is an art to flying”, said Ford, “or rather a knack. The knack lies in learning how to throw yourself at the ground and miss.”
Life, The Universe, and Everything, Douglas Adams


Previous submodule:
Next submodule: