What is the difference between double and int64_t in C/C++ for precision?

  • Context: C/C++ 
  • Thread starter Thread starter ORF
  • Start date Start date
  • Tags Tags
    Precision
Click For Summary

Discussion Overview

The discussion centers around the differences between the double and int64_t data types in C/C++, particularly regarding precision in numerical representation. Participants explore the implications of using double for numbers formatted with a specific precision (6 digits before the decimal point and 8 digits after) and whether int64_t would be a better choice for maintaining accuracy.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant questions whether the double type is sufficient for a specific numerical format (6f8) or if int64_t should be used instead.
  • Another participant clarifies that double is an IEEE754 floating point type and notes that conversion in C++ can lead to truncation, which may affect precision.
  • A participant suggests that rounding instead of truncating could help maintain precision when using double.
  • Concerns are raised about mixing types in calculations, with a suggestion to cast types explicitly to avoid ambiguity.
  • Discussion includes the presence and specifications of long double in various compilers, with some participants noting that its implementation can vary significantly.
  • One participant mentions that the C standards allow for flexibility in the implementation of floating point types, which could lead to unexpected behavior in different environments.

Areas of Agreement / Disagreement

Participants express differing views on the sufficiency of double for the specified precision, the handling of type conversions, and the implementation of long double across compilers. There is no consensus on the best approach or the implications of using different data types.

Contextual Notes

Participants highlight limitations related to type conversions, the potential for precision loss in floating point arithmetic, and the variability in long double implementations across different compilers. These factors contribute to the complexity of the discussion without resolving the underlying issues.

ORF
Messages
169
Reaction score
19
Hello

I am currently using double type for numbers with the format 6f8 (6 digits before the point, 8 digits after the point).

Double structure is enough for this format, or should I use int64_t instead?

Thank you in advance :)

Greetings.
PS: I tried with this example code, but the result it's a bit strange for me... (precision is lost for 4-digit double ? )
Code:
#include <iostream>
#include "stdint.h"

int main()
{
  std::count << "check precision of Double vs uint64_t\n";
  uint64_t k(1);
  while( k < 1e14 )
  {
      k=k*3+1;
     //-- check precision of myDouble
     double myDouble( k/1e8);
     uint64_t test( 1e8 * myDouble );
     if( k != test ) std::count << k << "\t" << test << std::endl;
    
  }
  return 0;
}
http://cpp.sh/8zqfx
The output I got is:
check precision of Double vs uint64_t
3280 3279
3812798742493 3812798742492
 
Technology news on Phys.org
double is not a structure. It is an IEEE754 floating point type. Conversion in C++ is truncation so if you actual number comes out to be 3812798742492.999 then you will still get 3812798742492 when converted to uint64_t. Try printing out the double value too and see what happens.

BoB
 
Hello

You were right: without conversion (truncation) it seems that the number is the same
Code:
#include <iostream>
#include "stdint.h"
#include <iomanip>
int main()
{
  std::count << "check precision of Double vs uint64_t\n";
  uint64_t k(1);
  while( k < 1e14 )
  {
      k=k*3+1;
     //-- check precision of myDouble
     double myDouble( k/1e8);
     uint64_t test( 1e8 * myDouble );
     if( k != test ) std::count << std::setprecision(16) << k << "\t" << myDouble*1e8 << std::endl;
     
  }
  return 0;
}

So, does it mean that no-digit is lost with 6f8 format using doubles? Where is the limit? (in cases that k > 1e15, the last digits are lost...)

Thank you for your time :)

Greetings
 
IEEE754 has a 52 bit mantissa, good enough for 15 digits ((log10(2^52) = 15.6...), so 14 digits (6f8) shouldn't be an issue if you round instead of truncate.

Code:
#include <iostream>

typedef unsigned long long uint64_t;

int main()
{
    std::cout << "check precision of Double vs uint64_t\n";
    uint64_t k(1);
    while( k < 1e14 )
    {
        k=k*3+1;
        //-- check precision of myDouble
        double myDouble = k/1e8;    // 1e8 is a double
        uint64_t test = (uint64_t)( 1e8 * myDouble + 0.5);  // + 0.5 for round
        if( k != test ) std::cout << k << "\t" << test << std::endl;
    }
    return 0;
}
 
Last edited:
Wait, your types are all wrong.

You can't just mix types like that, you have to cast them when you do any calculation or else division or multiplication with an integer will result in another integer THEN get cast as a double. You code is ambiguous as best.

Also, have you checked to see if long double is present in your compiler? It's not standard (it might be after c++11) but most compilers support it as either a 96 or 128 bit double. Use sizeof(long double) to check, it'll give you bytes.
 
newjerseyrunner said:
Wait, your types are all wrong.
Which post are you replying to?
newjerseyrunner said:
You can't just mix types like that, you have to cast them when you do any calculation or else division or multiplication with an integer will result in another integer THEN get cast as a double. You code is ambiguous as best.

Also, have you checked to see if long double is present in your compiler? It's not standard (it might be after c++11) but most compilers support it as either a 96 or 128 bit double.
I don't think this is correct. Long ago, the Borland C/C++ compiler I had distinguished between double and long double as 64 bits and 80 bits, respectively. The Microsoft compiler I have now (VS 2015) supports both, but both are the same size - 64 bits.

Microsoft (and possibly others) have a __m128 data type that can be used with Streaming SIMD Extensions 2 (SSE2) intrinsics, but this is different from float, double, and long double.
newjerseyrunner said:
Use sizeof(long double) to check, it'll give you bytes.
 
newjerseyrunner said:
Also, have you checked to see if long double is present in your compiler? It's not standard (it might be after c++11) but most compilers support it as either a 96 or 128 bit double. Use sizeof(long double) to check, it'll give you bytes.

Long double should be available in any C89 (or C90, C95, C99) compatible compiler. However technically long double could be equivalent to the float data type (it must not be worse than a double which must not be worse than a float). The C standards for floating point datatypes are very forgiving for doubles and long double so a compiler writer can implement a wide variety of solutions.

The range requirements for double/long double are for instance the same as for float (6 decimal digits) and precision requirements for double/long double are only specified as 10 decimal digits which is in-between the commonally used IEEE-754 binary32 (6 digit) and binary64 (15 digits) datatypes. I did some quick calculations and it looks like even 56 bits is actually overkill to satisfy the long double specification. So, usually float and double are IEEE-754 binary32 and binary64 respectivly but long double could be anything (but usually not worse than binary64)

This reminds me of an issue we had at work where some code failed because char turned out to be a 32 bit integer datatype on a DSP platform. It is an unusual but valid implementation ... the joys of the C standard.

If you want to marvel at it the C99 standards can be found here: http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf and the IEEE-754-2008 standard can be found here: http://www.csee.umbc.edu/~tsimo1/CMSC455/IEEE-754-2008.pdf
 
Last edited by a moderator:
newjerseyrunner said:
Wait, your types are all wrong.
1e8 is a double (at least with Visual Studio, so in the case of k / 1e8, k gets promoted to a double. I updated post #4 with a comment to note this.

long double
The older 16 bit Microsoft compilers use 80 bit floating point format for long doubles, but 32/64 bit Microsoft/Visual Studio compilers use 64 bit floating point for long doubles, the same as regular doubles.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
12K
Replies
6
Views
2K
  • · Replies 2 ·
Replies
2
Views
4K
  • · Replies 17 ·
Replies
17
Views
2K
Replies
20
Views
2K
Replies
89
Views
7K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 8 ·
Replies
8
Views
4K
  • · Replies 23 ·
Replies
23
Views
2K
Replies
4
Views
14K