Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Consequences of bad numerical computing

  1. May 16, 2009 #1


    User Avatar
    Staff Emeritus
    Science Advisor

    Some disasters attributable to bad numerical computing


    That's why software QC/QA are critical - both in the code/method and in the implementation/application.

    And not to forget: Garbage in = Garbage out
  2. jcsd
  3. May 17, 2009 #2


    User Avatar
    Science Advisor
    Homework Helper

    Wow! from the Patriot report:
    … and 28 soldiers had to die to find that out. :redface:
  4. May 17, 2009 #3

    D H

    User Avatar
    Staff Emeritus
    Science Advisor

    All three disasters show that verification and validation do not suffice. The "poor" handling of rounding errors would not have beena problem if the system was reset on a regular basis. The guidance, navigation, and control algorithms used in Ariane 5 Flight 501 worked just fine in the Ariane 4 rockets from which the algorithms were taken. The NASTRAN software used to configure the Sleipner A offshore platform has been vetted and re-vetted.

    What all three lacked are certification and accreditation. Was the Patriot battery certified to operate for 100 hours, continuously? (No) Was the Ariane 4 software accredited for use on the Ariane 5? (No) Was the use of NASTRAN, including the data, certified? (No).

    While verification and validation are incredibly important parts of the overall QA process, they are not enough. Just because the pieces have been proven to work correctly on their own (unit testing) and in conjunction (integrated testing) does not guarantee they will work properly when applied beyond their original bounds or when fed garbage. Nothing works well when it is fed garbage.
  5. May 17, 2009 #4


    User Avatar
    Science Advisor


    A little more seriously, what is the difference between
    -verification and validation?
    -certification and accreditation?

    If all 4 are followed is that sufficient QC/QA? Is there some standard reference for QC/QA process that engineers follow?
  6. May 17, 2009 #5

    D H

    User Avatar
    Staff Emeritus
    Science Advisor

    In a nutshell,
    • Validation: Are these the right requirements for this model? For this system?

    • Verification: Does the initial design / detailed design / initial prototype / as-built system correctly implement the requirements?

    • Accreditation: Is this the right model for this system? Just because you have a model that worked perfectly elsewhere (e.g., the Ariane 4 GNC system) does not mean it will work well when applied outside its intended domain. Somebody needs to take responsibility for this reuse and accredit the model for the new intended use.

    • Certification: Has someone (sometimes legally) certified that the system as a whole and in all of its parts works as planned, up to the operational limits of the system, and beyond? The last is important, and is the hardest to test: How does the system respond when things go wrong?
  7. May 17, 2009 #6
    I think that it shows us is that compilers are lacking some important analysis tools. Of the three events listed, 1 was due to rounding errors and the other due to overflow both of which could be detected by code analysis that tracks the effective precision of variables.
  8. May 17, 2009 #7

    D H

    User Avatar
    Staff Emeritus
    Science Advisor

    Compilers are not going to solve the problem of roundoff errors or overflow. Any system that uses floating point numbers is subject to roundoff errors, overflow, and underflow. The GNC software that caused the failure of Ariane 5 flight 501 worked just fine in the Ariane 4. The roundoff error that resulted in the Patriot mishap would not have been a problem if the system had been reset before the error could build up to a sizable amount.

    Compilers are an important but tiny part of the overall solution. How does a compiler know how the module it is compiling is to be used? How is a compiler going to detect faulty requirements, a bad design, incorrect but valid inputs, bad specs, erroneous operating manuals, false sales promises? Thinking that the problems can be automated away is wishful thinking. Wishful thinking is one of the key sources of errors in complex systems.
  9. May 18, 2009 #8
    Tools don't fix the problem; people fix the problem, using tools. But if you have no tools to detect a particular type of problem, the chances of a person fixing it are much lower. There does not exist a tool that I'm aware of which allows you to track the effective precision of a variable, and if there is, it's not in common usage. If such a tool existed and had been used to check these 2 programs, the bugs would have been detected. It would also be extremely useful when profiling numerical procedures to increase overall precision.
  10. May 18, 2009 #9
    The Ariane 5 problem-- the "overflow error" referred to-- would have been caught by modern tools, and it wouldn't have made any difference. In that case what happened was that a double-precision float was converted to a 16-bit integer where a 16-bit integer was insufficient. Modern compilers tend to recognize that this operation (converting a double-precision float to an integer) is a dangerous (precision-losing) operation and require an explicit cast to verify programmer intent. In current versions of GCC, this is a compile-time warning-- I think it is required to be a warning by C99 or some other ISO standard. In Microsoft's C# it is actually illegal, a compile-time error, to convert float to int without an explicit cast.

    However, given what D H tells us, this wouldn't have stopped the problem. The reason why is that the conversion float->int appears to have been intentional, a correct and spec-d out operation for the Ariane 4 rocket. So what would have happened if (and maybe they were?!) they were using compile tools that recognized truncation at the time the Ariane 4 was created is that the compiler would have emitted a warning, and the programmer would have simply cast to circumvent the warning. And with the cast embedded in the code base, it would have disabled any warning from appearing when the code was adapted for Ariane 5...
  11. May 18, 2009 #10

    D H

    User Avatar
    Staff Emeritus
    Science Advisor

    How do you know they didn't use an explicit cast?

    Flight software is typically chock-full of tests to ensure that the underlying math routines never raise an exception. Take a square root? Better check the number is positive first, and not too close to zero (square root of a tiny number results in underflow). Divide? Better check for lots of problems. Convert a floating point number to fixed point? Better check for overflow.

    Two problems arise: (1) What to do when one of those checks fail?, and (2) Are all of those checks really necessary? These are done in software, not firmware, so they are very expensive. My reading is that the Ariane 4 code was compiled with options to turn off some of that run-time checking by the FSW.

    Even if they had compiled the software with those checks enabled, the system would still have failed. Reusing the Ariane 4 code was a bad design decision. The Ariane 5 had enough thrust to make the readings converted to 16 bit integer go offscale high. The only difference in enabling the checks is the nature of the FDIR response. The machine throwing a floating point error made for a very quick response. I suspect a software check would have made exactly the same response, but not quite as quick. (Impending overflow: strike one; skip frame. Impending overflow: strike two, skip frame. Impending overflow: strike three, you're out, computer is suspect, reboot).

    The problem was not the use of 16 bit integers nor was it the conversion of a double to a short. The problem was the scale factor used in the conversion, and that was because they decided to reuse the Ariane 4 FSW as-is.

    It was already verified and validated, after all.
  12. May 18, 2009 #11
    They very probably did. The point I was trying to make is that it wouldn't have mattered either way, because this is looking for the problem in the wrong place (for the reasons you describe).
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook