Consequences of bad numerical computing

Astronuc · May 16, 2009

Some disasters attributable to bad numerical computing

http://www.ima.umn.edu/~arnold/disasters/disasters.html

That's why software QC/QA are critical - both in the code/method and in the implementation/application.

And not to forget: Garbage in = Garbage out

tiny-tim · May 17, 2009

Wow! from the Patriot report:

Specifically, the time in tenths of second as measured by the system's internal clock was multiplied by 1/10 to produce the time in seconds. This calculation was performed using a 24 bit fixed point register. In particular, the value 1/10, which has a non-terminating binary expansion, was chopped at 24 bits after the radix point. The small chopping error, when multiplied by the large number giving the time in tenths of a second, led to a significant error. Indeed, the Patriot battery had been up around 100 hours, and an easy calculation shows that the resulting time error due to the magnified chopping error was about 0.34 seconds.

… and 28 soldiers had to die to find that out. :redface:

D H · May 17, 2009

All three disasters show that verification and validation do not suffice. The "poor" handling of rounding errors would not have beena problem if the system was reset on a regular basis. The guidance, navigation, and control algorithms used in Ariane 5 Flight 501 worked just fine in the Ariane 4 rockets from which the algorithms were taken. The NASTRAN software used to configure the Sleipner A offshore platform has been vetted and re-vetted.

What all three lacked are certification and accreditation. Was the Patriot battery certified to operate for 100 hours, continuously? (No) Was the Ariane 4 software accredited for use on the Ariane 5? (No) Was the use of NASTRAN, including the data, certified? (No).

While verification and validation are incredibly important parts of the overall QA process, they are not enough. Just because the pieces have been proven to work correctly on their own (unit testing) and in conjunction (integrated testing) does not guarantee they will work properly when applied beyond their original bounds or when fed garbage. Nothing works well when it is fed garbage.

atyy · May 17, 2009

D H said:

Nothing works well when it is fed garbage.

http://en.wikipedia.org/wiki/Oscar_the_Grouch

A little more seriously, what is the difference between
-verification and validation?
-certification and accreditation?

If all 4 are followed is that sufficient QC/QA? Is there some standard reference for QC/QA process that engineers follow?

D H · May 17, 2009

atyy said:

A little more seriously, what is the difference between
-verification and validation?
-certification and accreditation?

In a nutshell,

Validation: Are these the right requirements for this model? For this system?
Verification: Does the initial design / detailed design / initial prototype / as-built system correctly implement the requirements?
Accreditation: Is this the right model for this system? Just because you have a model that worked perfectly elsewhere (e.g., the Ariane 4 GNC system) does not mean it will work well when applied outside its intended domain. Somebody needs to take responsibility for this reuse and accredit the model for the new intended use.
Certification: Has someone (sometimes legally) certified that the system as a whole and in all of its parts works as planned, up to the operational limits of the system, and beyond? The last is important, and is the hardest to test: How does the system respond when things go wrong?

junglebeast · May 17, 2009

I think that it shows us is that compilers are lacking some important analysis tools. Of the three events listed, 1 was due to rounding errors and the other due to overflow both of which could be detected by code analysis that tracks the effective precision of variables.

D H · May 17, 2009

Compilers are not going to solve the problem of roundoff errors or overflow. Any system that uses floating point numbers is subject to roundoff errors, overflow, and underflow. The GNC software that caused the failure of Ariane 5 flight 501 worked just fine in the Ariane 4. The roundoff error that resulted in the Patriot mishap would not have been a problem if the system had been reset before the error could build up to a sizable amount.

Compilers are an important but tiny part of the overall solution. How does a compiler know how the module it is compiling is to be used? How is a compiler going to detect faulty requirements, a bad design, incorrect but valid inputs, bad specs, erroneous operating manuals, false sales promises? Thinking that the problems can be automated away is wishful thinking. Wishful thinking is one of the key sources of errors in complex systems.

junglebeast · May 18, 2009

D H said:

Compilers are not going to solve the problem of roundoff errors or overflow. Any system that uses floating point numbers is subject to roundoff errors, overflow, and underflow. The GNC software that caused the failure of Ariane 5 flight 501 worked just fine in the Ariane 4. The roundoff error that resulted in the Patriot mishap would not have been a problem if the system had been reset before the error could build up to a sizable amount.

Tools don't fix the problem; people fix the problem, using tools. But if you have no tools to detect a particular type of problem, the chances of a person fixing it are much lower. There does not exist a tool that I'm aware of which allows you to track the effective precision of a variable, and if there is, it's not in common usage. If such a tool existed and had been used to check these 2 programs, the bugs would have been detected. It would also be extremely useful when profiling numerical procedures to increase overall precision.

Coin · May 18, 2009

The Ariane 5 problem-- the "overflow error" referred to-- would have been caught by modern tools, and it wouldn't have made any difference. In that case what happened was that a double-precision float was converted to a 16-bit integer where a 16-bit integer was insufficient. Modern compilers tend to recognize that this operation (converting a double-precision float to an integer) is a dangerous (precision-losing) operation and require an explicit cast to verify programmer intent. In current versions of GCC, this is a compile-time warning-- I think it is required to be a warning by C99 or some other ISO standard. In Microsoft's C# it is actually illegal, a compile-time error, to convert float to int without an explicit cast.

However, given what D H tells us, this wouldn't have stopped the problem. The reason why is that the conversion float->int appears to have been intentional, a correct and spec-d out operation for the Ariane 4 rocket. So what would have happened if (and maybe they were?!) they were using compile tools that recognized truncation at the time the Ariane 4 was created is that the compiler would have emitted a warning, and the programmer would have simply cast to circumvent the warning. And with the cast embedded in the code base, it would have disabled any warning from appearing when the code was adapted for Ariane 5...

D H · May 18, 2009

How do you know they didn't use an explicit cast?

Flight software is typically chock-full of tests to ensure that the underlying math routines never raise an exception. Take a square root? Better check the number is positive first, and not too close to zero (square root of a tiny number results in underflow). Divide? Better check for lots of problems. Convert a floating point number to fixed point? Better check for overflow.

Two problems arise: (1) What to do when one of those checks fail?, and (2) Are all of those checks really necessary? These are done in software, not firmware, so they are very expensive. My reading is that the Ariane 4 code was compiled with options to turn off some of that run-time checking by the FSW.

Even if they had compiled the software with those checks enabled, the system would still have failed. Reusing the Ariane 4 code was a bad design decision. The Ariane 5 had enough thrust to make the readings converted to 16 bit integer go offscale high. The only difference in enabling the checks is the nature of the FDIR response. The machine throwing a floating point error made for a very quick response. I suspect a software check would have made exactly the same response, but not quite as quick. (Impending overflow: strike one; skip frame. Impending overflow: strike two, skip frame. Impending overflow: strike three, you're out, computer is suspect, reboot).

The problem was not the use of 16 bit integers nor was it the conversion of a double to a short. The problem was the scale factor used in the conversion, and that was because they decided to reuse the Ariane 4 FSW as-is.

It was already verified and validated, after all.

Coin · May 18, 2009

D H said:

How do you know they didn't use an explicit cast?

They very probably did. The point I was trying to make is that it wouldn't have mattered either way, because this is looking for the problem in the wrong place (for the reasons you describe).

Consequences of bad numerical computing

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Use of AI (ML/DL) in Science

Other than just FizzBuzz to test programmer candidates

Python Why does my loop run slower with larger lists in Python?

File Structure vs Data Structure

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight