The smallest floating point x such that x+2=x

  • Thread starter Thread starter phil.st
  • Start date Start date
  • Tags Tags
    Floating Point
Click For Summary

Discussion Overview

The discussion centers on finding the smallest floating point number x such that x + 2 = x. Participants explore various programming approaches, particularly in C and MATLAB, and discuss the implications of floating point representation in computing.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant presents a C program intended to find the smallest x but acknowledges it only finds a power of two within an interval containing x.
  • Another suggests approaching the problem from a bit-level perspective, discussing the bit-patterns for floating point numbers.
  • Several participants critique the initial program, arguing that the while loop should divide x by 2 instead of multiplying it, as the goal is to find a smaller number.
  • Discussion includes specific values for single and double precision floating point representations, with one participant calculating that for single precision, x = 2^25 and for double precision, x = 2^54.
  • Confusion arises regarding the output of MATLAB, with one participant questioning why the smallest x is equal to 1.801439850948198e+016, leading to discussions about precision settings in MATLAB and compiler optimizations in C.
  • Some participants express uncertainty about the correct approach and the implications of floating point arithmetic in their calculations.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the best method to find the smallest floating point x, with multiple competing views on programming approaches and interpretations of floating point representation.

Contextual Notes

There are limitations regarding assumptions about floating point precision, the impact of compiler optimizations, and the specific programming environments used by participants. The discussion reflects varying levels of understanding of floating point arithmetic.

Who May Find This Useful

This discussion may be useful for programmers and researchers interested in numerical methods, floating point arithmetic, and the intricacies of computer representation of real numbers.

phil.st
Messages
6
Reaction score
0
Hi! I want to calculate the smallest floating point x such that x+2=x. I've written a programme in C that does not actually determines the smallest floating point x such that x+2=x, but rather determines a power of two within an interval that contains the smallest floating point x. Does anyone have any ideas how can I improve this solution or find out a new one? Is there any built-in function in MATLAB and/or C to get accurate approximations of that x?

Code:
#include <stdio.h>
 
 int main( int argc, char **argv )
 {
    float x = 1.0f;
 
    printf( "current x \t 2 + current x\n" );
    do {
       printf( "%G\t\t %.20f\n", x, (2.0f + x) );
       x *= 2.0f;
    }
    while ((float)(2.0 + x) != x);
 
    printf( "\nCalculated x: %G\n", x );
    return 0;
 }

Code:
Calculated x: 3.68935E+019
 
Technology news on Phys.org
Might I suggest you approach this from the bit level? You may however not be a bit-head and won't like this idea. But if you're interested, for example, on a 32-bit machine, say floating point numbers are double-wide so 64 bits, what is the smallest and next largest number centered at 2 in 64-bit floating-point format that can be coded? What are the bit-values for the exponent and mantissa for these numbers? There has to be a definite bit-pattern for this.
 
phil.st said:
Hi! I want to calculate the smallest floating point x such that x+2=x. I've written a programme in C that does not actually determines the smallest floating point x such that x+2=x, but rather determines a power of two within an interval that contains the smallest floating point x. Does anyone have any ideas how can I improve this solution or find out a new one?
Your "solution" is not a solution at all, so there is lots of room for improvement. The value you calculated, 3.68935E+019, is NOT a small number.

In your code, the while loop repeatedly multiplies your starting value by 2. You should be dividing it by 2 in each loop iteration.
phil.st said:
Is there any built-in function in MATLAB and/or C to get accurate approximations of that x?

Code:
#include <stdio.h>
 
 int main( int argc, char **argv )
 {
    float x = 1.0f;
 
    printf( "current x \t 2 + current x\n" );
    do {
       printf( "%G\t\t %.20f\n", x, (2.0f + x) );
       x *= 2.0f;
    }
    while ((float)(2.0 + x) != x);
 
    printf( "\nCalculated x: %G\n", x );
    return 0;
 }

Code:
Calculated x: 3.68935E+019
 
jackmell said:
Might I suggest you approach this from the bit level? You may however not be a bit-head and won't like this idea. But if you're interested, for example, on a 32-bit machine, say floating point numbers are double-wide so 64 bits, what is the smallest and next largest number centered at 2 in 64-bit floating-point format that can be coded? What are the bit-values for the exponent and mantissa for these numbers? There has to be a definite bit-pattern for this.

Thank you for your answer. I like your approach but I'm a bit confused. Could you be more specific? In 64-bit floating-point format, sign bit = 1 bit, exponent width = 11 bits and mantissa = 52 bit.
So, I suppose that: (2)_{10} =(0.1 \underbrace{000...0}_\text{51})_2 \times 2^2
What's next?

Mark44 said:
Your "solution" is not a solution at all, so there is lots of room for improvement. The value you calculated, 3.68935E+019, is NOT a small number.

In your code, the while loop repeatedly multiplies your starting value by 2. You should be dividing it by 2 in each loop iteration.

I'm searching for the smallest floating point x so that x+2=x. Obviously, x won't be a small number. If I divide it by 2 in each iteration, the programme loops endlessly.
 
Last edited:
Single precision floating point numbers have a sign bit, an 8 bit exponent, and a 24 bit signifcand where the upper bit is assumed to be 1 so only the lower 23 bits are stored in the number.

http://en.wikipedia.org/wiki/Single_precision_floating-point_format

For single precicion, (x+2) (truncated) = x when x = 2^25 = 33554432, which in single precision format will be encoded as ((25+127) << 23) = hex 4c000000. Trying to add 2 to this number will result in the 2 shifted off to the right of the signifcand when adding, assuming the compiler doesn't include some rounding up function internally.

For double precision, the signficand has 53 bits (52 stored), so you'd want
2^54 = 18014398509481984 which will be stored as ((54+1023)<< 52) = hex 4350000000000000

http://en.wikipedia.org/wiki/Double_precision_floating-point_format
 
Last edited:
Each time you enter the test loop, instead of multiplying the old x value by 2, why not divide by 2?
 
Mark44 said:
Your "solution" is not a solution at all, so there is lots of room for improvement. The value you calculated, 3.68935E+019, is NOT a small number.

In your code, the while loop repeatedly multiplies your starting value by 2. You should be dividing it by 2 in each loop iteration.
He wants a largish number. He is looking for the smallest x such that x+2 == x. You are thinking of the largest x such that x+2 == 2.phil.st:
That number is a bit big; it looks like a double rather than a float. I get 33554432 rather than 3.68935E+019.
 
@rcgldr: MATLAB gives me the following results:
Code:
>> format long e
>> x = 2^54

x =

    1.801439850948198e+016

>> tf = isequal(x,x+2)

tf =

     1

>>

You're right! However, I don't understand why the smallest x is equal to 1.801439850948198e+016.

@D H: To be honest, I'm totally confused at this point.
 
What machine, what compiler are you using?
 
  • #10
D H said:
What machine, what compiler are you using?

1) Intel i3 CPU M 350 @ 2.27GHz, 3.00 GB RAM, Windows 7 x64
2) Dev-C++ 5.0 beta 9.2 (4.9.9.2) with Mingw/GCC, MATLAB
 
  • #11
phil.st said:
@rcgldr: MATLAB gives me the following results:
...
You're right! However, I don't understand why the smallest x is equal to 1.801439850948198e+016.
Because that is exactly what you should get for a double.
You haven't told MATLAB to use single precision, so it is using the default double precision.

You are probably compiling optimized, and your compiler is (erroneously) eliminating the cast to float, instead using the internal 80 bit doubles for the comparison. Try compiling unoptimized, and also tell the compiler to stop using floating point registers so much:
Code:
#include <stdio.h>
 
 int main( int argc, char **argv )
 {
    float x = 1.0f;
    volatile float xp2 = x + 2.0f;
 
    printf( "current x \t 2 + current x\n" );
    do {
       printf( "%.1f\t\t %.1f\n", x, xp2 );
       x *= 2.0f;
       xp2 = x + 2.0f;
    }
    while (xp2 != x);
 
    printf( "\nCalculated x: %.1f\n", x );
    return 0;
 }
 
  • #12
phil.st said:
Thank you for your answer. I like your approach but I'm a bit confused. Could you be more specific? In 64-bit floating-point format, sign bit = 1 bit, exponent width = 11 bits and mantissa = 52 bit.
So, I suppose that: (2)_{10} =(0.1 \underbrace{000...0}_\text{51})_2 \times 2^2
What's next?
.

Ok, sorry I mis-read your post. I thought you were looking for x+2=2 but I think the bit-analysis I suggested would still work for x+2=x which I think the others in here are doing.
 
  • #13
jackmell said:
Ok, sorry I mis-read your post. I thought you were looking for x+2=2 ...
As did I.
 
  • #14
D H said:
Because that is exactly what you should get for a double.
You haven't told MATLAB to use single precision, so it is using the default double precision.

You are probably compiling optimized, and your compiler is (erroneously) eliminating the cast to float, instead using the internal 80 bit doubles for the comparison. Try compiling unoptimized, and also tell the compiler to stop using floating point registers so much:
Code:
#include <stdio.h>
 
 int main( int argc, char **argv )
 {
    float x = 1.0f;
    volatile float xp2 = x + 2.0f;
 
    printf( "current x \t 2 + current x\n" );
    do {
       printf( "%.1f\t\t %.1f\n", x, xp2 );
       x *= 2.0f;
       xp2 = x + 2.0f;
    }
    while (xp2 != x);
 
    printf( "\nCalculated x: %.1f\n", x );
    return 0;
 }

Thanks man! I appreciate your help. Everything is ok now!
 
  • #15
D H said:
You are probably compiling optimized, and your compiler is (erroneously) eliminating the cast to float, instead using the internal 80 bit doubles

update - DH is correct - I forgot the orignal value for the C program (not the Matlab program) corresponds to 2^65, which works for x+2 == 2 in Intel 80 bit extended precision, which has a 64 bit significand:

http://en.wikipedia.org/wiki/Extended_precision
 
Last edited:
  • #16
rcgldr said:
D H said:
You are probably compiling optimized, and your compiler is (erroneously) eliminating the cast to float, instead using the internal 80 bit doubles for the comparison.
That's unlikely, since 2^54 works for x+2 == 2, which works for a 64 bit double, but not for an internal 80 bit value. Note that IEEE floats and doubles can usually be compared directly as if they were integers (except for issues like NAN's).
I stand by what I said. Look at the stated value in the original post:
phil.st said:
Code:
Calculated x: 3.68935E+019

2^54 is 1.8014398509481984×1016, considerably smaller than the stated value.

2^65 is 3.6893488147419103232×1019, which is the stated value. The Intel 80 bit extended precision format has a 64 bit significand and does not use normalized numbers as does the IEEE floating point standard.
 
  • #17
D H said:
Look at the stated value in the original post:
2^65 is 3.6893488147419103232×1019, which is the stated value. The Intel 80 bit extended precision format has a 64 bit significand and does not use normalized numbers as does the IEEE floating point standard.
You're correct, I forgot about the value in the original post. I updated my previous post. Wiki article for Intel 80 bit extended precision format:

http://en.wikipedia.org/wiki/Extended_precision

rcgldr said:
Note that IEEE floats and doubles can usually be compared directly as if they were integers (except for issues like NAN's).
Already quoted but removed from previous post. This is only true for positive IEEE floating point numbers with less than maximum exponent values (used to represent infinity or NAN). For negative IEEE floating piont numbers, only the sign bit is set (as opposed to using the equivalent of a two's complement format), so you'd have to reverse the sense of a compare. A compiler is going to use a floating point compare to compare floating point numbers, so sorry for this bit of off topic trivia. (Must remember to not post when I'm tired).

D H said:
You are probably compiling optimized, and your compiler is (erroneously) eliminating the cast to float, instead using the internal 80 bit doubles for the comparison.
I tested the program with various versions of Microsoft compiliers, and it didn't matter if optimizer was enabled (/Ox) or disabled (/Od) : 16 bit Visual C++ 2.2 used Intel 80 bit extended precision and stopped at 2^65, 32 bit Visual C++ 4.0 and Visual Studio 2005 used 64 bit double precision and stopped at 2^54. Forcing the compiler to do 32 bit integer (long) compares via pointers and casting solved the problem even with optimizer enabled:

Code:
#include <stdio.h>
 
int main(int argc, char **argv )
{
    float x = 1.0f;
    float y;
 
    do {
       x *= 2.0f;
       y = x + 2.0f;
    }
    while (*(long *)(&y) != *(long *)(&x));
 
    printf( "\nCalculated x: %21.0f\n", x );
    return 0;
}

This while statement would work with optimizer disabled, but failed with optimizer enabled:

Code:
    while (y != x);
 
Last edited:

Similar threads

  • · Replies 4 ·
Replies
4
Views
1K
  • · Replies 4 ·
Replies
4
Views
1K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
4
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
Replies
1
Views
2K
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K