Numerical floating point addition

Click For Summary
Adding sorted numbers from smallest to largest in magnitude minimizes truncation error in floating point addition. Summing in the opposite order can lead to less accurate results due to the way floating point arithmetic handles precision. To further reduce truncation errors, a method using an array of 2048 doubles can be employed to hold intermediate sums based on the exponent of the numbers. This technique involves storing and combining numbers efficiently to avoid overflow and maintain accuracy. Overall, for optimal results in floating point addition, start with the smallest magnitudes first.
Khashishi
Science Advisor
Messages
2,812
Reaction score
491
If I have a bunch of sorted numbers spanning a large range of magnitudes, is it better to add them up from smallest to largest or from largest to smallest, or something else?

Let's say I'm summing an array A, which is sorted from large to small. Which gives a more accurate result:

sum1 = 0
for i=0,length(A)-1
sum1 += A
end

sum2 = 0
for i=length(A)-1,0
sum2 += A
end
 
Mathematics news on Phys.org
Adding the numbers in order, smallest to largest (in magnitude) is better (less truncation error).

It's also possible to further reduce truncation error by using a function and an array of 2048 doubles to hold intermediate sums where the index into the array is based on the exponent of a double precision number ( in C the index = (* (unsigned __int64 *)(&number)) >> 52) & 0x7ff; ). The array is initialized to zero, and each time a new number is to be added, the index for that number is generated. If array[index] == 0. , then the number is just stored, else number = number + array[index]; array[index] = 0.; a new index for number is generated and the process repeated until array[index] == 0 and the number is stored (or until overflow is detected). Once all numbers have been added, then the array is summed from index = 0 to index = 0x7ff to produce the sum. The purpose of this is minimize truncation.
 
Last edited:
if they're floating point, that's a no-brainer.

start with zero and add the small ones (in magnitude, negative or positive) first.
 
Here is a little puzzle from the book 100 Geometric Games by Pierre Berloquin. The side of a small square is one meter long and the side of a larger square one and a half meters long. One vertex of the large square is at the center of the small square. The side of the large square cuts two sides of the small square into one- third parts and two-thirds parts. What is the area where the squares overlap?

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
9
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
12
Views
2K
  • · Replies 5 ·
Replies
5
Views
5K
  • · Replies 7 ·
Replies
7
Views
16K