Consequences of bad numerical computing

Click For Summary

Discussion Overview

The discussion revolves around the consequences of poor numerical computing practices, particularly in the context of high-stakes systems such as military and aerospace applications. Participants explore various incidents attributed to numerical errors, emphasizing the importance of quality control and the limitations of existing verification and validation processes.

Discussion Character

  • Debate/contested
  • Technical explanation
  • Conceptual clarification

Main Points Raised

  • Some participants highlight specific disasters caused by numerical computing errors, such as the Patriot missile system failure, attributing them to inadequate software quality control and the principle of "garbage in, garbage out."
  • Others discuss the differences between verification, validation, certification, and accreditation, suggesting that all four processes are necessary but may not be sufficient for ensuring quality control.
  • There is a contention regarding the role of compilers in detecting numerical errors, with some arguing that compilers lack the necessary tools to analyze rounding errors and overflow, while others assert that compilers cannot fully address these issues due to the complexity of software systems.
  • Participants debate whether the reuse of software from the Ariane 4 in the Ariane 5 was a sound decision, with some suggesting that the original software was not adequately accredited for its new application.
  • Concerns are raised about the effectiveness of runtime checks in flight software, questioning their necessity and the potential for performance trade-offs.

Areas of Agreement / Disagreement

Participants express a range of views on the effectiveness of current quality control measures and the role of compilers in preventing numerical errors. There is no consensus on whether existing practices are sufficient or how to best address the issues raised.

Contextual Notes

Limitations in the discussion include unresolved questions about the adequacy of verification and validation processes, the dependence on specific definitions of terms like certification and accreditation, and the challenges in tracking effective precision in numerical computations.

Astronuc
Staff Emeritus
Science Advisor
Gold Member
2025 Award
Messages
22,589
Reaction score
7,554
Some disasters attributable to bad numerical computing

http://www.ima.umn.edu/~arnold/disasters/disasters.html


That's why software QC/QA are critical - both in the code/method and in the implementation/application.


And not to forget: Garbage in = Garbage out
 
Technology news on Phys.org
Wow! from the Patriot report:
Specifically, the time in tenths of second as measured by the system's internal clock was multiplied by 1/10 to produce the time in seconds. This calculation was performed using a 24 bit fixed point register. In particular, the value 1/10, which has a non-terminating binary expansion, was chopped at 24 bits after the radix point. The small chopping error, when multiplied by the large number giving the time in tenths of a second, led to a significant error. Indeed, the Patriot battery had been up around 100 hours, and an easy calculation shows that the resulting time error due to the magnified chopping error was about 0.34 seconds.

… and 28 soldiers had to die to find that out. :redface:
 
All three disasters show that verification and validation do not suffice. The "poor" handling of rounding errors would not have beena problem if the system was reset on a regular basis. The guidance, navigation, and control algorithms used in Ariane 5 Flight 501 worked just fine in the Ariane 4 rockets from which the algorithms were taken. The NASTRAN software used to configure the Sleipner A offshore platform has been vetted and re-vetted.

What all three lacked are certification and accreditation. Was the Patriot battery certified to operate for 100 hours, continuously? (No) Was the Ariane 4 software accredited for use on the Ariane 5? (No) Was the use of NASTRAN, including the data, certified? (No).

While verification and validation are incredibly important parts of the overall QA process, they are not enough. Just because the pieces have been proven to work correctly on their own (unit testing) and in conjunction (integrated testing) does not guarantee they will work properly when applied beyond their original bounds or when fed garbage. Nothing works well when it is fed garbage.
 
D H said:
Nothing works well when it is fed garbage.

http://en.wikipedia.org/wiki/Oscar_the_Grouch

A little more seriously, what is the difference between
-verification and validation?
-certification and accreditation?

If all 4 are followed is that sufficient QC/QA? Is there some standard reference for QC/QA process that engineers follow?
 
atyy said:
A little more seriously, what is the difference between
-verification and validation?
-certification and accreditation?

In a nutshell,
  • Validation: Are these the right requirements for this model? For this system?
  • Verification: Does the initial design / detailed design / initial prototype / as-built system correctly implement the requirements?
  • Accreditation: Is this the right model for this system? Just because you have a model that worked perfectly elsewhere (e.g., the Ariane 4 GNC system) does not mean it will work well when applied outside its intended domain. Somebody needs to take responsibility for this reuse and accredit the model for the new intended use.
  • Certification: Has someone (sometimes legally) certified that the system as a whole and in all of its parts works as planned, up to the operational limits of the system, and beyond? The last is important, and is the hardest to test: How does the system respond when things go wrong?
 
I think that it shows us is that compilers are lacking some important analysis tools. Of the three events listed, 1 was due to rounding errors and the other due to overflow both of which could be detected by code analysis that tracks the effective precision of variables.
 
Compilers are not going to solve the problem of roundoff errors or overflow. Any system that uses floating point numbers is subject to roundoff errors, overflow, and underflow. The GNC software that caused the failure of Ariane 5 flight 501 worked just fine in the Ariane 4. The roundoff error that resulted in the Patriot mishap would not have been a problem if the system had been reset before the error could build up to a sizable amount.

Compilers are an important but tiny part of the overall solution. How does a compiler know how the module it is compiling is to be used? How is a compiler going to detect faulty requirements, a bad design, incorrect but valid inputs, bad specs, erroneous operating manuals, false sales promises? Thinking that the problems can be automated away is wishful thinking. Wishful thinking is one of the key sources of errors in complex systems.
 
D H said:
Compilers are not going to solve the problem of roundoff errors or overflow. Any system that uses floating point numbers is subject to roundoff errors, overflow, and underflow. The GNC software that caused the failure of Ariane 5 flight 501 worked just fine in the Ariane 4. The roundoff error that resulted in the Patriot mishap would not have been a problem if the system had been reset before the error could build up to a sizable amount.

Tools don't fix the problem; people fix the problem, using tools. But if you have no tools to detect a particular type of problem, the chances of a person fixing it are much lower. There does not exist a tool that I'm aware of which allows you to track the effective precision of a variable, and if there is, it's not in common usage. If such a tool existed and had been used to check these 2 programs, the bugs would have been detected. It would also be extremely useful when profiling numerical procedures to increase overall precision.
 
The Ariane 5 problem-- the "overflow error" referred to-- would have been caught by modern tools, and it wouldn't have made any difference. In that case what happened was that a double-precision float was converted to a 16-bit integer where a 16-bit integer was insufficient. Modern compilers tend to recognize that this operation (converting a double-precision float to an integer) is a dangerous (precision-losing) operation and require an explicit cast to verify programmer intent. In current versions of GCC, this is a compile-time warning-- I think it is required to be a warning by C99 or some other ISO standard. In Microsoft's C# it is actually illegal, a compile-time error, to convert float to int without an explicit cast.

However, given what D H tells us, this wouldn't have stopped the problem. The reason why is that the conversion float->int appears to have been intentional, a correct and spec-d out operation for the Ariane 4 rocket. So what would have happened if (and maybe they were?!) they were using compile tools that recognized truncation at the time the Ariane 4 was created is that the compiler would have emitted a warning, and the programmer would have simply cast to circumvent the warning. And with the cast embedded in the code base, it would have disabled any warning from appearing when the code was adapted for Ariane 5...
 
  • #10
How do you know they didn't use an explicit cast?

Flight software is typically chock-full of tests to ensure that the underlying math routines never raise an exception. Take a square root? Better check the number is positive first, and not too close to zero (square root of a tiny number results in underflow). Divide? Better check for lots of problems. Convert a floating point number to fixed point? Better check for overflow.

Two problems arise: (1) What to do when one of those checks fail?, and (2) Are all of those checks really necessary? These are done in software, not firmware, so they are very expensive. My reading is that the Ariane 4 code was compiled with options to turn off some of that run-time checking by the FSW.

Even if they had compiled the software with those checks enabled, the system would still have failed. Reusing the Ariane 4 code was a bad design decision. The Ariane 5 had enough thrust to make the readings converted to 16 bit integer go offscale high. The only difference in enabling the checks is the nature of the FDIR response. The machine throwing a floating point error made for a very quick response. I suspect a software check would have made exactly the same response, but not quite as quick. (Impending overflow: strike one; skip frame. Impending overflow: strike two, skip frame. Impending overflow: strike three, you're out, computer is suspect, reboot).

The problem was not the use of 16 bit integers nor was it the conversion of a double to a short. The problem was the scale factor used in the conversion, and that was because they decided to reuse the Ariane 4 FSW as-is.

It was already verified and validated, after all.
 
  • #11
D H said:
How do you know they didn't use an explicit cast?
They very probably did. The point I was trying to make is that it wouldn't have mattered either way, because this is looking for the problem in the wrong place (for the reasons you describe).
 

Similar threads

  • · Replies 63 ·
3
Replies
63
Views
5K
  • · Replies 1 ·
Replies
1
Views
4K
Replies
29
Views
6K
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 12 ·
Replies
12
Views
4K
Replies
1
Views
2K