softwarenotperfect

Why Your Software is Never Perfect

Estimated Read Time: 6 minute(s)
Common Topics: code, software, perfect, ozone, test

We occasionally have students ask for help on software, “My software is perfect, but it doesn’t work!” Your software is never perfect. My own software is never perfect.

I recently found that I made someone’s top ten list with regard to software that I had written 37 years ago. It’s not a top ten list anyone would aspire to be listed on. Software that I wrote in 1979 is number two on this list of ten historical software bugs with extreme consequences. I learned some very important lessons from that experience.

Background

My first job out of college was to document what everyone thought was a very complete and very well-tested set of software for processing low-level data solar backscatter ultraviolet / total ozone mapping system (SBUV/TOMS) on NASA’s Nimbus 7 satellite. The entire development had already moved on to other projects with different employers. They left behind a large set of code and a very tall stack of computer printouts that contained their test results.

I started from the state of “What is this ‘FORTRAN’ language?” but quickly proceeded to “How can this code possibly work?” and from there to “There’s no way this code can possibly work!” I finally looked at that massive stack of test results on reaching that final stage of understanding. I was the first to do so except for the developers who had abandoned the ship. Nobody else had looked at those test results. They instead looked at the amazing thickness of the printouts.

Testing by thickness always has been and always will be a phenomenally bad idea. Some of those test printouts were slim; these were failed compilations. The rest were what was then called “ABEND dumps.” In those days, the equivalent of what is now called a segmentation fault resulted in the entire virtual memory for the process in question being printed out in hexadecimal. The result was a huge waste of paper. (The modern equivalent is a segfault and core dump.) Not one test indicated success.

This turned out to be a career-maker for me. I made a name for myself by fixing that mess. As a result, I was subsequently given the privilege of working directly for the principal investigator of that pair of instruments and his team of scientists. Instead of the low-level computer science stuff involved with my first task, my next task truly did involve scientific programming.

Why the Nimbus 7 satellite did not discover the ozone hole

Of the two ozone measuring instruments on the Nimbus 7 satellite, one (SBUV) had been flown previously, but the more precise instrument (TOMS) was brand new. The previously flown instrument sometimes yielded flaky results when the solar angle was low, and the team scientists were worried that the same would apply to this newer instrument. The scientific team did not want their good scientific names sullied by suspect data. As a result, they vehemently insisted that I filter out those suspect results by resetting data where the solar incidence angle was low and where the estimated ozone quantity lay outside a predetermined range to a value that meant “missing or invalid data”.

I argued that if I did what he asked that there would be no way to discover anomalies. “We can change your code if we discover anomalies,” I suggested that I produce two products, an unfiltered one for NASA internal use only and a filtered version for release to the wider research community. They did not want any part of that, either, on the basis that the unfiltered version would somehow get outside of NASA. “Do what I told you to do, or we will tell your employer to assign someone else to us.” I capitulated and did what he told me to do.

Karma!

The Nimbus 7 satellite did not discover the ozone hole. Credit for that discovery instead goes to Joseph Farnam, who simply pointed a device invented in the 1920s (a Dobsonmeter) up into the sky. Mr. Farnam received a very nice obituary in the New York Times. The SBUV/TOMS team will more or less die anonymously. That’s karma.

As I should have been more insistent with the scientific team, I too was stricken with karma. In 1986, curious minds at NASA wanted very much to know why their very expensive satellite did not discover what a person using a 1920s era device had discovered. The scientific team discovered that my name was all over the code that hid the ozone hole from NASA. (They conveniently forget why this was the case.) This made people high up in NASA want to talk to me, personally. Despite having switched employers three times and despite having moved 1400+ miles away from that initial job, I received numerous phone calls and even a few visits from people very high up in NASA that year. I told them why that code existed, and also how to fix it. Voila! After reprocessing the archived satellite data, the Antarctic springtime ozone hole appeared every year.

What I learned

  • Lesson number one: Take responsibility for your code.
    Version control software provides the ability to establish blame (or credit) for who wrote/modified each and every line of code in the codebase. Your name is the sole name attached to the code you write, not your boss’s name, nor that of your customer. You never know who’s going to come back to you seven years or more after the fact regarding the code that you wrote. It’s your name that will be on the code, so take full responsibility for it. While I took full responsibility for fixing that very bad code right out of college, I did not take full responsibility for the code I wrote immediately afterward. I should have.
  • Lesson number two: Your code is never perfect.
    As I noted at the outset, this site occasionally receives posts that start with “My code is perfect, but it doesn’t work right! Help me!” If your code doesn’t work right, it is not perfect, by definition. Typical code has a bug per one hundred lines. Well-crafted, well-inspected, and well-tested code typically has one bug per one thousand lines, or perhaps one bug per every ten thousand lines if done very carefully. Pushing beyond that one bug per a few thousand lines of code is very hard and very expensive. The Space Shuttle flight software reportedly had less than one bug per two hundred thousand lines of code. This incredibly low error rate was achieved at the cost of writing code at the glacial rate of two lines of code per person per week, after taking into account the hours people spent writing and reviewing requirements, writing and reviewing test code, writing and reviewing the test results, and attending meeting after boring meeting. Even with all that, the Space Shuttle flight software was not perfect. It was however as close to perfection as code can possibly be. (Note: I did not participate in that process. It would have killed me.)
  • Lesson number three: Even if your code is perfect, it is not perfect.
    This is the difference between verification and validation. Verification asks whether the code does exactly what the requirements or user stories say that the code should do. There’s a hidden bug just waiting to manifest itself if the tests are incomplete (and the tests always are incomplete). While verification is hard, validation is even harder yet. Validation asks whether the requirements/user stories are correct. This is not something that typically can be automated. In the case of Nimbus 7, there was a faulty requirement to filter out suspect data. Because I initially balked at writing the code to implement this, there was an explicit test, written by me and reviewed by others, that ensured that the code filtered out those suspect values. Faulty requirements result not only in faulty code but also in faulty tests that prove that the code behaves faultily, just as required.

 

 

52 replies
« Older Comments
  1. rootone says:

    All compilers have options to insert code which can help with detection of anomalies.
    This does result in longer run times, but modern CPUs don't really notice it.
    If you have got your source code polished down to being near 100% reliable, there might be a small advantage in turning of those debugging options.

  2. eachus says:
    Sherwood Botsford

    Hmm. Pascal was my first language. Years ago when I was using Turbo Pascal, you had the option for range checking. By default it was on. The compiler inserted checks on every array and pointer operation so that the program couldn't access data not originally assigned to that variable. Wonder how much that slows down the software. My estimate is only a few percent.Negative. Not the idea, the impact of global checking enabled. One of the first things I learned when we were developing an Ada compiler at Honeywell, was how to find every place the compiler generated code to raise errors. Some of them I was looking to fix the compiler because the compiler had missed an inference. Some? Oops! Blush my error. And a few really belonged there.

    Today you can program in Spark, which is technically a strict subset of Ada, plus a verifier and other tools. It is extra work, but worth if for software you need to trust. Sometimes you need that quality, or better. I remember one program where the "Red Team" I was on was dinged because ten years after the software was fielded, no decision had been made about whether to hire the contractor to maintain the code, or use an organic (AF) facility. I just shook my head. There were still no reported bugs, and I don't think there will ever be. Why? If you remember the movie War Games, the project was writing the code for Wopper. Technically the computer did not have firing authority. It just decoded launch control messages presented to the officers in the silos–well in-between a group of five silos. The software could also change targeting by reprogramming the missiles directly. We very much wanted the software to be perfect when fielded, and not changed for any reason without months (or longer) of revalidation.

    Let me also bring up two disasters that show the limit of what can be accomplished. In a financial maneuver, Arianespace let a contract to upgrade the flight control software for Ariane 4. I'm hazy on why this was needed, but Arianespace to save money on Ariane 5, decided they were going to reuse all of the flight control and engine management hardware and software from Ariane 4 on Ariane 5. The Ariane 4 FCS was subcontracted by the German firm doing the hardware part of the upgrade, to an English company. The software developers aware that the software would also be used on the Ariane 5, asked to see the high-level design documents for Ariane 5. Fearing that this would give the English an advantage in bidding for Ariane 5 work, the (French) management at Arianespace refused.

    Since the guidance hardware and software were already developed, the test plan was a "full-up" test where the engines and gyroscopes would be mounted, as in the Ariane 5 on a two degree of freedom platform which would allow for a flight of Ariane 5's first stage from launch to burnout. It ran way over budget and behind schedule. When it became the "long pole in the tent," the last box on a PERT or GANTT chart. Rather than wait another year on a project already behind schedule, they went ahead with the launch.

    If you have any experience with software for systems like this, you know that there are about a dozen "constants" that are only constant for a given version of that rocket, jet engine, airplane, etc. Things like gross takeoff weight, moments of inertia, not to exceed engine deflections, and so on. Since the software hadn't been touched, the Ariane 5 launched with Ariane 4 parameters. One difference was that Ariane 5 heads East a lot earlier in the mission. And Ariane 4 had a program to update the inertial guidance system which Ariane 4 needed to run for 40 seconds after launch. (On Ariane 5 it could be shut off at t=0. Ariane 4 could be aborted and rapidly "turned around" until t=6. That capability was used, but couldn't happen on Ariane 5. At immediately after t=0, it was airborne. On the first Ariane 5 launch the clock reached t=39, and the Ariane 5 was "impossibly" far away from the launch site, and the unnecessary software (which was only needed, remember, on the pad) aborted both flight guidance computers. The engines deflected too far, the stack broke up, and a lot of VIPs got showered by hardware. (Fortunately no one was killed.)

    What does this story tell you? That software you write, perfect for its intended function, can cause a half a billion Euros of damage, if it is used for an unintended purpose. We used to say that software sharing needs to be bottom up, not top down, because an M2 tank is not an M1 tank. Now we point to Ariane instead.

    The second case is even more of a horror, and not just because it killed several hundred people. The A320 was programmed by the French in a language that they developed to allow formal proofs of correctness. The prototype crashed at an airshow, fortunately most of the people on board survived. The pilot had been operating way outside the intended operating envelope. He made a low slow pass, then went to pull up to clear some trees. The plane wanted to land, but there wasn't enough runway, and when the plane and pilot agreed on aborting the landing and going around, it was too late. The engines couldn't spool up fast enough. Unfortunately the (French) investigators didn't look at why the engines hadn't been spooled up already.

    There were several more A320 crashes during landings and very near the runway. Lots of guesses, no clear answers. Then there was a flight into Strasbourg, the pilots had flown the route many times before, but this time they were approaching from the north due to winds. The plane flew through a pass, then dived into the ground. The French "probable cause" claims pilot error in setting the decent rate into the autopilot. The real problem seems to have been that the pilots set a waypoint in the autopilot at a beacon in the pass. The glide path for their intended runway, if extended to the beacon, was underground. The autopilot would try to put the aircraft into the middle of the glide path as soon as possible after the last waypoint. Sounds perfectly sensible in English, French, and the special programming language. But of course, that requirement was deadly.

    The other crashes were not as clear. Yes, a waypoint above a rise in the ground, and an altitude for clearing the waypoint that put the plane above the center of the glide path. Pilots argue that the A320 could be recovered after flipping its tail up. The software has been fixed, the French say the problem was a dual use control (degrees of descent or 100's of feet per minute) and that pilots could misread or ignore the setting. But the real lesson is that it doesn't matter how fancy your tools are, if they can conceal fatal flaws in the programmer's thinking. (And conceal them from those who review the software as well.)

« Older Comments

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply