Why Are There Floating Point Errors When Parsing Data in Pandas?

In summary: You can also use the str() function to display a floating point number exactly.You can also use the str() function to display a floating point number exactly.
  • #1
Zap
406
120
TL;DR Summary
25.15 = 25.149999999999977??
I am parsing data from an excel sheet using pandas ExcelFile class and read_excel function. I noticed that when reading the file, there are floating point errors in the results. For example, a cell that contains 25.15 will be 25.149999999999977 in the pandas data frame.

The value 25.15 in Excel is not rounded. It is the result of adding 298.15 minus 273.

This link describes the problem, but there is not solution : https://github.com/pandas-dev/pandas/issues/21885

I'm surprised ExcelFile and read_excel has an error like this. It pretty much makes them unusable, in my opinion.
 
Technology news on Phys.org
  • #2
Do you need 13 digits of precision? And if you do, why are you using floats?
 
  • #3
Zap said:
It pretty much makes them unusable
Is that so ?
What accuracy is required by your application ?
 
  • #4
By the way, doesn't the fact that 1 + 1e-15 - 1 =0 in Excel make the whole spreadsheet program Excel unusable in the first place ?
a
1​
b
1.00E-15​
c = a+ b
1.000000000000000000​
c-1
0.000000000000000000​
 
  • #5
I need valid data. It's not about accuracy. I don't want to upload data that is clearly an error like that. It's a data quality issue, not a precision issue.

I've tried parsing them as strings, but the same issue arises. It's a string '25.149999999999977 ' instead of the expected '25.15'.

I'm wondering if there is a way to correct this floating point error without doing something like round(), because it will not be easy to tell the computer at which decimal place to round.
 
  • #6
BvU said:
By the way, doesn't the fact that 1 + 1e-15 - 1 =0 in Excel make the whole spreadsheet program Excel unusable in the first place ?
a
1​
b
1.00E-15​
c = a+ b
1.000000000000000000​
c-1
0.000000000000000000​
Context is key here. I'm not doing calculations. I'm uploading cited data that needs to have a certain level of integrity.
 
  • #7
Excel has about 16 digits of accuracy. What kind of data has its integrity damaged by that ? and, again:
BvU said:
What accuracy is required by your application ?
Zap said:
Context is key here.
I'm really curious about the context !

Because there are solutions for many applications...
 
  • #8
It's a floating point error. I don't want 16 digits.
 
  • #9
Yes No Yes No

You still haven't explained why
 
  • #10
I don't want to upload data into a data base that is clearly an error such as a floating point error, because data quality is of upmost importance for this database, and it will be available publicly. Would you want to access a website and see that floating point error as a cited experimental measurement?
 
  • #11
I am asking if you know of a way to correct for this error. If not, then thank you for your response.
 
  • #12
Zap said:
Would you want to access a website and see that floating point error as a cited experimental measurement?
A decent website doesn't present all digits. For empirical data 16 digits is more than enough

Zap said:
I am asking if you know of a way to correct for this error.
It's not an error.

##\ ##
 
  • Like
Likes anorlunda
  • #13
It's called a floating point error because you can't represent 25.15 exactly in binary.
 
  • #14
Okay. So, in Python, '{:n}'.format( x ), where x = 25.149999999999977 evaluates to '25.15,' which solves my problem.

Interestingly, if you do float( '{:n}'.format( x ) ), the result is 25.15, even though x = 25.149999999999977. Python has some built-in way of outputting floats exactly, even though behind the scenes the displayed value 25.15 may be stored as something like 25.1499999999999 ... I'm assuming SQL has something similar, but if I upload the data as 25.1499999999999, I will not get the clean display of 25.15. This was the original issue, and I have fixed it myself.

In a Pandas dataframe, you can apply the function lambda x : float( '{:n}'.format( float( x ) ) ) to remove floating point error.

Thanks,
Zap.
 
Last edited:
  • Like
Likes BvU
  • #15
Zap said:
It's called a floating point error because you can't represent 25.15 exactly in binary.
There are very many decimal numbers that can't be represented exactly in a finite number of binary digits. Excel is almost certainly using double precision (64 bits) to represent real numbers. Some decimal numbers, such as 0.2 or 0.1 have terminating decimal representations, but their binary representations don't terminate. As a result they get rounded and truncated to fit in 64 bits.

This is a common problem in virtually all programming languages, particularly Fortran, C, and C++. Some languages, such as C#, have a Decimal class that can maintain higher precision for use by financial applications. Java has a DecimalFormat that can be used to deal with the same kind of problem. I'm not an expert in Java, so it could be that some other class or package can be used.
 
  • Like
Likes BvU and lomidrevo
  • #16
I understand what is a floating point error. I was looking for a fix. Typically, Python auto corrects the display of floating point numbers, unless they are the result of a calculation or are otherwise not displayed exactly. To display a floating point number exactly, you can simply apply the function below, which will display 25.149999999999977 as 25.15 and totally solves my problem.

Python:
fix_float_error = lambda x : float( '{:n}'.format( x ) )
 
  • #17
Zap said:
I understand what is a floating point error. I was looking for a fix. Typically, Python auto corrects the display of floating point numbers, unless they are the result of a calculation or are otherwise not displayed exactly. To display a floating point number exactly, you can simply apply the function below, which will display 25.149999999999977 as 25.15 and totally solves my problem.

Python:
fix_float_error = lambda x : float( '{:n}'.format( x ) )

Isn't that just rounding the numbers like like you suggested doing yourself. It is possible it will erroniously truncate some numbers. And converting the formatted string of the rounded float back to float won't help.

Python:
>>> x = 10.15
>>> print(x)
10.15
>>> format(x,'.60g')
'10.1500000000000003552713678800500929355621337890625'
>>> x = float( '{:n}'.format( x ) )
>>> format(x, '.60g')
'10.1500000000000003552713678800500929355621337890625'
>>> y = 10.00000000000000001
>>> print(y)
10.0
>>> '{:n}'.format( y )
'10'

Do you have access to the correct numbers? If so, why not store them as strings?
 
Last edited:
  • #18
If I simply convert it to a string, or try to parse the data as strings, the pandas Excel parser gives '25.149999999999977' instead of ' 25.15.' Additionally, I want to store these values as double data types in SQL. So, strings will not work. SQL has some way of representing 25.15 exactly, just as Python does, but it will not represent the number as 25.15 if I upload it as 25.149999999999977.

I don't think float( '{:n}'.format( x ) ) is the same as rounding. The documentation is here. From what I understand, float( '{:n}'.format( x ) ) will force Python to represent the floating point number exactly, although it will still be stored as something like 25.149999999999977, because storing the exact value 25.15 is not possible in binary. Python will display the number as 25.15, even though it is really something like 25.149999999999977. SQL also has some way of displaying the exact value of floating point numbers, when the real value is an approximation in binary. That's why you can write x = 25.15 in Python, and it will display as 25.15. The same is true with SQL and Excel.

The problem has something to do with converting Excel data to Python data, and this built-in feature of representing floating point numbers as exact values in both systems is somehow lost in translation.

I don't want to put 25.149999999999977 into a database, because this data will be part of a professional website, and that's not what a user should see. It's an obvious floating point error. No one wants to see that.
 
Last edited:
  • #19
Zap said:
SQL has some way of representing 25.15 exactly, just as Python does,
Yes, but both languages will just round and/or truncate a number to a desired number of places. SQL has a Decimal data type that allows you to specify the precision and scale. In SQL they define precision as the total number of digits displayed, and scale as the number of digits to the right of the decimal point. Some more info: Understanding the SQL Decimal data type (sqlshack.com)
Zap said:
From what I understand, float( '{:n}'.format( x ) ) will force Python to represent the floating point number exactly, although it will still be stored as something like 25.149999999999977, because storing the exact value 25.15 is not possible in binary.
You can do something simpler to force Python to display a floating point number to a desired precision.
Python:
x = 20.15

for i in range(18):
    print(format(x, str( '.'+ str(i) + 'f')))
In the print statement, the 2nd argument of format is a string that I'm constructing. The first few format specifiers will be .0f, .1f, .2f, and so on, up to .17f.
Here's the output of the above:
Code:
C:\Users\Mark\Documents\Python3.4.2>python test2.py
20
20.1
20.15
20.150
20.1500
20.15000
20.150000
20.1500000
20.15000000
20.150000000
20.1500000000
20.15000000000
20.150000000000
20.1500000000000
20.15000000000000
20.149999999999999
20.1499999999999986
20.14999999999999858
Of course, a number such as 20.125 will display the exact result with any number of decimal places, because the bit pattern for 20.125 terminates.
 
  • Like
Likes lomidrevo
  • #20
Zap said:
SQL has some way of representing 25.15 exactly, just as Python does
Python and some SQL databases have the decimal type. This will avoid the floating point errors. A quick look at pandas indicates that you may be able to use boxed decimals but it doesn't support them natively.

Zap said:
I don't think float( '{:n}'.format( x ) ) is the same as rounding... float( '{:n}'.format( x ) ) will force Python to represent the floating point number exactly

What else would it be doing?

The closest value that a double precision IEEE754 floating point number can get to 25.15 is:
25.14999999999999857891452847979962825775146484375

The exact value of this number is 25.14999999999999857891452847979962825775146484375

Any program that uses IEEE754 floating point values internally that tells you the number is 25.15 is rounding. Any claim to the contrary is a lie.

If you tell python to round 25.15 to one decimal place you get 25.1 since you are actually telling it to round 25.14999999999999857891452847979962825775146484375
Python:
>>> print('{:.1f}'.format( 25.15 ))
25.1

BoB

EDIT> Mark44 and Jarvis323 posted while I was typing so most of this is redundant.
 
  • #21
That's displaying a string value, no?

Python does represent some floating point numbers as exact results, even though behind the scenes they are approximations in binary. The same is true for SQL and Excel. Otherwise, you could not assign a floating point value of 25.15 to a cell or variable, but you can.

This documentation alludes to '{:n}'.format as a way of enforcing the false representation of approximated floating point numbers as exact values. From what I can tell, it is doing that. The floating point errors encountered after reading the Excel sheets disappeared after applying float( '{:n}'.format( x ) ), so I'm not in a rush to try another method.

I will double check to see if there were any truncation errors, but it seemed to have worked beautifully.
 
  • #22
Jarvis323 said:
Anyways, I think that rounding the numbers this way is fine. But note that lambda x : float( '{:n}'.format( float( x ) ) ) is pointless. In the case of 25.15, stored as 25.149999999999977, it will just round and convert it to a string, and then convert it back to a float with the same value of 25.149999999999977. It doesn't change anything at all. In some cases, as in the case of y in my example (where the rounded value is less correct than the stored value), it will actually just cause you to lose some precision.

It changes it into a floating representation of 25.15, which is a display of 25.15 for a value of 25.149999999999977.
 
  • #23
rbelli1 said:
Python and some SQL databases have the decimal type. This will avoid the floating point errors. A quick look at pandas indicates that you may be able to use boxed decimals but it doesn't support them natively.
What else would it be doing?

The closest value that a double precision IEEE754 floating point number can get to 25.15 is:
25.14999999999999857891452847979962825775146484375

The exact value of this number is 25.14999999999999857891452847979962825775146484375

Any program that uses IEEE754 floating point values internally that tells you the number is 25.15 is rounding. Any claim to the contrary is a lie.

If you tell python to round 25.15 to one decimal place you get 25.1 since you are actually telling it to round 25.14999999999999857891452847979962825775146484375
Python:
>>> print('{:.1f}'.format( 25.15 ))
25.1

BoB

EDIT> Mark44 and Jarvis323 posted while I was typing so most of this is redundant.

You can't round to 25.15 , because you can't calculate 25.15 in binary. If you see a "rounded" 25.15 floating point number, the program has hid the nasty floating point error and is displaying 25.15 to you instead, even though the true value is something like 25.1499999999999 ...
 
  • #24
Zap said:
You can't round to 25.15 , because you can't calculate 25.15 in binary.

When we round the "ugly" floating point value to the string value "25.15" we certainly can do that exactly in binary.

Python:
>>> print('{:.2f}'.format( 25.14999999999999857891452847979962825775146484375 ))
25.15

double 0x4039266666666666 gets converted to the series of bytes 0x32, 0x35, 0x2E, 0x31, 0x35.

BoB
 
  • #25
You can round to a floating point number of 25.15, but it's not really 25.15, it is displayed as 25.15, and it's not a string. It's a floating point number. Try it for yourself, or read the documentation. You can display the floating point number as 25.15, even though it's something else, for the hundredth time!
 
  • #26
Zap said:
If you see a "rounded" 25.15 floating point number, the program has hid the nasty floating point error and is displaying 25.15 to you instead, even though the true value is something like 25.1499999999999 ...
You do know that 25.1499999999999... is exactly equal to 25.15, right? The ellipsis (...) means that the 9 digits repeat that pattern forever.

As someone already asked, do you really need more than 13 decimal digits of precision in your application? I can't think of any measurements that would be that precise.
 
  • #27
Are you reading this thread? Do you seriously think I meant an infinitely repeating number on a computer? I am not interested in increasing precision. Please read the thread before responding.
 
  • #28
Python:
>>> x = 25.149999999999977
>>> x
25.149999999999977
>>> float( '{:n}'.format( x ) )
25.15
>>> x = float( '{:n}'.format( x ) )
>>> type( x )
<class 'float'>
>>> x
25.15
>>> x = 0.3 - 0.2
>>> x
0.09999999999999998
>>> float( '{:n}'.format( x ) )
0.1
>>> x = float( '{:n}'.format( x ) )
>>> x
0.1
>>> type( x )
<class 'float'>
>>>
 
  • #29
The whole point was representing the float as 25.15. The point I'm trying to make is that the floating point number is being represented exactly as 25.15, even though in the background it is still a binary approximation, and float( '{:n}'.format( x ) ) is coercing Python to represent the binary approximation as an exact value. How many times do I have to say this?
 
  • #30
You can't round to 25.15 ...

The number is being displayed as 25.15, it isn't being rounded. You can't calculate 25.15 in binary. So, how is Python giving me a floating point number of 25.15? Because it's being DISPLAYED AS 25.15 AND NOT ACTUALLY 25.15!
 
  • #31
Zap said:
I noticed that when reading the file, there are floating point errors in the results. For example, a cell that contains 25.15 will be 25.149999999999977 in the pandas data frame.
For double precision floating point; 25.15 = 25.149999999999977;
That is NOT an error. It is a failure to format and round the output.

If you look at the excel formatting information for numbers in the column that shows 25.15, you will probably see it is specified to display as "xxx.xx"; which will show the number rounded to 2 decimal places. Excel is storing the double precision number as 25.149999999999977 internally.

If you want to control the way a double precision number appears in pandas, you will need to read the pandas manual on display formatting, which probably uses the Round(,) format function. Like Excel, pandas is also storing the double precision number as 25.149999999999977 internally.
https://www.xspdf.com/resolution/51898492.html
 
  • Like
Likes lomidrevo, Vanadium 50 and BvU
  • #32
Zap said:
Are you reading this thread?

Are you?

Lots of good stuff here.
 
  • #33
Zap said:
Are you reading this thread? Do you seriously think I meant an infinitely repeating number on a computer? I am not interested in increasing precision. Please read the thread before responding.
I have read the entire thread. No, I don't believe you meant an infinitely repeating number, but that's what you wrote:
Zap said:
even though the true value is something like 25.1499999999999 ...

As I recall, you have a degree in physics, so you should know exactly what the ellipsis means in 25.1499999999999 ...
 
  • #34
Strange.

There is a way to coerce Python to represent numbers with floating point errors as exact values, even though internally they are being stored as a binary approximation, as Baluncore has said.

It is not the same as rounding, because you cannot round to a number that is impossible to calculate in binary.

So for example, if you have a floating point number of 25.149999999999977, you cannot round it to 25.15, because 25.15 cannot be calculated exactly in binary. If you attempt to round it to 25.15, Python will display the value as 25.15, even though internally it is still something like 25.149999999999977. So, it's not really rounding but coercing Python to represent the binary approximation as an exact value.

However, as stated earlier, I don't want to use the round() function, because it's impossible to tell the computer in a simple way at which decimal place to round or which numbers have floating point errors and which do not, and I don't actually want to round numbers.

So, instead, there is a way to coerce Python to display certain numbers that contain floating point errors as exact results while keeping the original value intact, which is not rounding.
 
Last edited:
  • #35
Zap said:
So for example, if you have a floating point number of 25.149999999999977, you cannot round it to 25.15, because 25.15 cannot be calculated exactly in binary.
You're confusing the internal representation of a number (e.g., 25.149999999999977) with its displayed value, which in your example is the displayed value of 25.15.
The value that is displayed is a string of numerical digit characters together with a decimal point.
Zap said:
If you attempt to round it to 25.15, Python will display the value as 25.15, even though internally it is still something like 25.149999999999977. So, it's not really rounding but coercing Python to represent the binary approximation as an exact value.
The number 25.149999999999977 does not exist in memory. Instead, there is the binary representation, most likely in 64 bits. This 64-bit value is rounded or truncated to the desired precision and converted to a string of characters by print() or format(), but the underlying number is not changed.

The Python code I showed a while back shows a sequence of displayed values for the variable x, with varying numbers of decimal digits. At no time is the underlying value of x changed due to rounding or truncation. The only things that change are the (string) representations of this number according to how it's being formatted.
 
<h2>1. Why do floating point errors occur when parsing data in Pandas?</h2><p>Floating point errors occur when parsing data in Pandas because of the way computers store and manipulate numbers. Floating point numbers are stored as binary fractions, which can result in rounding errors when converting between decimal and binary representations.</p><h2>2. How do floating point errors affect data analysis in Pandas?</h2><p>Floating point errors can affect data analysis in Pandas by introducing small inaccuracies in calculations, which can compound and lead to larger errors in results. This can be especially problematic in financial or scientific data analysis where precision is crucial.</p><h2>3. Can floating point errors be avoided in Pandas?</h2><p>While it is not possible to completely avoid floating point errors in Pandas, there are steps that can be taken to minimize their impact. These include using appropriate data types, rounding or truncating data, and being aware of potential sources of error in calculations.</p><h2>4. How can I identify and troubleshoot floating point errors in Pandas?</h2><p>To identify and troubleshoot floating point errors in Pandas, it is important to carefully examine the data and the calculations being performed. This may involve checking for unexpected values or using alternative methods for performing calculations.</p><h2>5. Are floating point errors specific to Pandas or do they occur in other data analysis tools?</h2><p>Floating point errors can occur in any data analysis tool that uses floating point numbers, not just Pandas. This includes other programming languages, spreadsheet programs, and statistical software. It is important for data analysts to be aware of these potential errors and how to handle them in their specific tools.</p>

1. Why do floating point errors occur when parsing data in Pandas?

Floating point errors occur when parsing data in Pandas because of the way computers store and manipulate numbers. Floating point numbers are stored as binary fractions, which can result in rounding errors when converting between decimal and binary representations.

2. How do floating point errors affect data analysis in Pandas?

Floating point errors can affect data analysis in Pandas by introducing small inaccuracies in calculations, which can compound and lead to larger errors in results. This can be especially problematic in financial or scientific data analysis where precision is crucial.

3. Can floating point errors be avoided in Pandas?

While it is not possible to completely avoid floating point errors in Pandas, there are steps that can be taken to minimize their impact. These include using appropriate data types, rounding or truncating data, and being aware of potential sources of error in calculations.

4. How can I identify and troubleshoot floating point errors in Pandas?

To identify and troubleshoot floating point errors in Pandas, it is important to carefully examine the data and the calculations being performed. This may involve checking for unexpected values or using alternative methods for performing calculations.

5. Are floating point errors specific to Pandas or do they occur in other data analysis tools?

Floating point errors can occur in any data analysis tool that uses floating point numbers, not just Pandas. This includes other programming languages, spreadsheet programs, and statistical software. It is important for data analysts to be aware of these potential errors and how to handle them in their specific tools.

Similar threads

  • Programming and Computer Science
Replies
8
Views
5K
Back
Top