Numerical floating point addition

Click For Summary
SUMMARY

When summing a sorted array of floating-point numbers, adding them from smallest to largest in magnitude yields a more accurate result due to reduced truncation error. The discussion emphasizes the importance of order in floating-point addition, particularly when dealing with large ranges of magnitudes. An advanced technique is proposed, utilizing an array of 2048 doubles to hold intermediate sums, indexed by the exponent of the double precision number, to further minimize truncation errors during the addition process.

PREREQUISITES
  • Understanding of floating-point arithmetic
  • Familiarity with C programming language
  • Knowledge of numerical stability concepts
  • Experience with data structures, specifically arrays
NEXT STEPS
  • Research techniques for minimizing truncation error in numerical computations
  • Learn about the IEEE 754 standard for floating-point arithmetic
  • Explore the implementation of summation algorithms in C
  • Investigate the use of Kahan summation algorithm for improved accuracy
USEFUL FOR

Mathematicians, software developers, and engineers involved in numerical analysis, particularly those working with floating-point computations and seeking to enhance the accuracy of their algorithms.

Khashishi
Science Advisor
Messages
2,812
Reaction score
491
If I have a bunch of sorted numbers spanning a large range of magnitudes, is it better to add them up from smallest to largest or from largest to smallest, or something else?

Let's say I'm summing an array A, which is sorted from large to small. Which gives a more accurate result:

sum1 = 0
for i=0,length(A)-1
sum1 += A
end

sum2 = 0
for i=length(A)-1,0
sum2 += A
end
 
Physics news on Phys.org
Adding the numbers in order, smallest to largest (in magnitude) is better (less truncation error).

It's also possible to further reduce truncation error by using a function and an array of 2048 doubles to hold intermediate sums where the index into the array is based on the exponent of a double precision number ( in C the index = (* (unsigned __int64 *)(&number)) >> 52) & 0x7ff; ). The array is initialized to zero, and each time a new number is to be added, the index for that number is generated. If array[index] == 0. , then the number is just stored, else number = number + array[index]; array[index] = 0.; a new index for number is generated and the process repeated until array[index] == 0 and the number is stored (or until overflow is detected). Once all numbers have been added, then the array is summed from index = 0 to index = 0x7ff to produce the sum. The purpose of this is minimize truncation.
 
Last edited:
if they're floating point, that's a no-brainer.

start with zero and add the small ones (in magnitude, negative or positive) first.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
Replies
9
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
7K
Replies
12
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 7 ·
Replies
7
Views
16K
  • · Replies 2 ·
Replies
2
Views
2K