Why Are There Floating Point Errors When Parsing Data in Pandas?

Zap · Feb 11, 2021

I am parsing data from an excel sheet using pandas ExcelFile class and read_excel function. I noticed that when reading the file, there are floating point errors in the results. For example, a cell that contains 25.15 will be 25.149999999999977 in the pandas data frame.

The value 25.15 in Excel is not rounded. It is the result of adding 298.15 minus 273.

This link describes the problem, but there is not solution : https://github.com/pandas-dev/pandas/issues/21885

I'm surprised ExcelFile and read_excel has an error like this. It pretty much makes them unusable, in my opinion.

Vanadium 50 · Feb 11, 2021

Do you need 13 digits of precision? And if you do, why are you using floats?

BvU · Feb 11, 2021

Zap said:

It pretty much makes them unusable

Is that so ?
What accuracy is required by your application ?

BvU · Feb 11, 2021

By the way, doesn't the fact that 1 + 1e-15 - 1 =0 in Excel make the whole spreadsheet program Excel unusable in the first place ?

a	1
b	1.00E-15
c = a+ b	1.000000000000000000
c-1	0.000000000000000000

Zap · Feb 11, 2021

I need valid data. It's not about accuracy. I don't want to upload data that is clearly an error like that. It's a data quality issue, not a precision issue.

I've tried parsing them as strings, but the same issue arises. It's a string '25.149999999999977 ' instead of the expected '25.15'.

I'm wondering if there is a way to correct this floating point error without doing something like round(), because it will not be easy to tell the computer at which decimal place to round.

Zap · Feb 11, 2021

BvU said:

By the way, doesn't the fact that 1 + 1e-15 - 1 =0 in Excel make the whole spreadsheet program Excel unusable in the first place ?

a
1
b
1.00E-15
c = a+ b
1.000000000000000000
c-1

0.000000000000000000

Context is key here. I'm not doing calculations. I'm uploading cited data that needs to have a certain level of integrity.

BvU · Feb 11, 2021

Excel has about 16 digits of accuracy. What kind of data has its integrity damaged by that ? and, again:

BvU said:

What accuracy is required by your application ?

Zap said:

Context is key here.

I'm really curious about the context !

Because there are solutions for many applications...

Zap · Feb 11, 2021

It's a floating point error. I don't want 16 digits.

BvU · Feb 11, 2021

Yes No Yes No

You still haven't explained why

Zap · Feb 11, 2021

I don't want to upload data into a data base that is clearly an error such as a floating point error, because data quality is of upmost importance for this database, and it will be available publicly. Would you want to access a website and see that floating point error as a cited experimental measurement?

Zap · Feb 11, 2021

I am asking if you know of a way to correct for this error. If not, then thank you for your response.

BvU · Feb 11, 2021

Zap said:

Would you want to access a website and see that floating point error as a cited experimental measurement?

A decent website doesn't present all digits. For empirical data 16 digits is more than enough

Zap said:

I am asking if you know of a way to correct for this error.

It's not an error.

##\ ##

Zap · Feb 11, 2021

It's called a floating point error because you can't represent 25.15 exactly in binary.

Zap · Feb 11, 2021

Okay. So, in Python, '{:n}'.format( x ), where x = 25.149999999999977 evaluates to '25.15,' which solves my problem.

Interestingly, if you do float( '{:n}'.format( x ) ), the result is 25.15, even though x = 25.149999999999977. Python has some built-in way of outputting floats exactly, even though behind the scenes the displayed value 25.15 may be stored as something like 25.1499999999999 ... I'm assuming SQL has something similar, but if I upload the data as 25.1499999999999, I will not get the clean display of 25.15. This was the original issue, and I have fixed it myself.

In a Pandas dataframe, you can apply the function lambda x : float( '{:n}'.format( float( x ) ) ) to remove floating point error.

Thanks,
Zap.

Mark44 · Feb 11, 2021

Zap said:

It's called a floating point error because you can't represent 25.15 exactly in binary.

There are very many decimal numbers that can't be represented exactly in a finite number of binary digits. Excel is almost certainly using double precision (64 bits) to represent real numbers. Some decimal numbers, such as 0.2 or 0.1 have terminating decimal representations, but their binary representations don't terminate. As a result they get rounded and truncated to fit in 64 bits.

This is a common problem in virtually all programming languages, particularly Fortran, C, and C++. Some languages, such as C#, have a Decimal class that can maintain higher precision for use by financial applications. Java has a DecimalFormat that can be used to deal with the same kind of problem. I'm not an expert in Java, so it could be that some other class or package can be used.

Zap · Feb 11, 2021

I understand what is a floating point error. I was looking for a fix. Typically, Python auto corrects the display of floating point numbers, unless they are the result of a calculation or are otherwise not displayed exactly. To display a floating point number exactly, you can simply apply the function below, which will display 25.149999999999977 as 25.15 and totally solves my problem.

Python:

fix_float_error = lambda x : float( '{:n}'.format( x ) )

Jarvis323 · Feb 11, 2021

Zap said:
I understand what is a floating point error. I was looking for a fix. Typically, Python auto corrects the display of floating point numbers, unless they are the result of a calculation or are otherwise not displayed exactly. To display a floating point number exactly, you can simply apply the function below, which will display 25.149999999999977 as 25.15 and totally solves my problem.
Python:
fix_float_error = lambda x : float( '{:n}'.format( x ) )

Isn't that just rounding the numbers like like you suggested doing yourself. It is possible it will erroniously truncate some numbers. And converting the formatted string of the rounded float back to float won't help.

Python:

>>> x = 10.15
>>> print(x)
10.15
>>> format(x,'.60g')
'10.1500000000000003552713678800500929355621337890625'
>>> x = float( '{:n}'.format( x ) )
>>> format(x, '.60g')
'10.1500000000000003552713678800500929355621337890625'
>>> y = 10.00000000000000001
>>> print(y)
10.0
>>> '{:n}'.format( y )
'10'

Do you have access to the correct numbers? If so, why not store them as strings?

Zap · Feb 11, 2021

If I simply convert it to a string, or try to parse the data as strings, the pandas Excel parser gives '25.149999999999977' instead of ' 25.15.' Additionally, I want to store these values as double data types in SQL. So, strings will not work. SQL has some way of representing 25.15 exactly, just as Python does, but it will not represent the number as 25.15 if I upload it as 25.149999999999977.

I don't think float( '{:n}'.format( x ) ) is the same as rounding. The documentation is here. From what I understand, float( '{:n}'.format( x ) ) will force Python to represent the floating point number exactly, although it will still be stored as something like 25.149999999999977, because storing the exact value 25.15 is not possible in binary. Python will display the number as 25.15, even though it is really something like 25.149999999999977. SQL also has some way of displaying the exact value of floating point numbers, when the real value is an approximation in binary. That's why you can write x = 25.15 in Python, and it will display as 25.15. The same is true with SQL and Excel.

The problem has something to do with converting Excel data to Python data, and this built-in feature of representing floating point numbers as exact values in both systems is somehow lost in translation.

I don't want to put 25.149999999999977 into a database, because this data will be part of a professional website, and that's not what a user should see. It's an obvious floating point error. No one wants to see that.

Mark44 · Feb 11, 2021

Zap said:

SQL has some way of representing 25.15 exactly, just as Python does,

Yes, but both languages will just round and/or truncate a number to a desired number of places. SQL has a Decimal data type that allows you to specify the precision and scale. In SQL they define precision as the total number of digits displayed, and scale as the number of digits to the right of the decimal point. Some more info: Understanding the SQL Decimal data type (sqlshack.com)

Zap said:

From what I understand, float( '{:n}'.format( x ) ) will force Python to represent the floating point number exactly, although it will still be stored as something like 25.149999999999977, because storing the exact value 25.15 is not possible in binary.

You can do something simpler to force Python to display a floating point number to a desired precision.

Python:

x = 20.15

for i in range(18):
    print(format(x, str( '.'+ str(i) + 'f')))

In the print statement, the 2nd argument of format is a string that I'm constructing. The first few format specifiers will be .0f, .1f, .2f, and so on, up to .17f.
Here's the output of the above:

Code:

C:\Users\Mark\Documents\Python3.4.2>python test2.py
20
20.1
20.15
20.150
20.1500
20.15000
20.150000
20.1500000
20.15000000
20.150000000
20.1500000000
20.15000000000
20.150000000000
20.1500000000000
20.15000000000000
20.149999999999999
20.1499999999999986
20.14999999999999858

Of course, a number such as 20.125 will display the exact result with any number of decimal places, because the bit pattern for 20.125 terminates.

rbelli1 · Feb 11, 2021

Zap said:

SQL has some way of representing 25.15 exactly, just as Python does

Python and some SQL databases have the decimal type. This will avoid the floating point errors. A quick look at pandas indicates that you may be able to use boxed decimals but it doesn't support them natively.

Zap said:

I don't think float( '{:n}'.format( x ) ) is the same as rounding... float( '{:n}'.format( x ) ) will force Python to represent the floating point number exactly

What else would it be doing?

The closest value that a double precision IEEE754 floating point number can get to 25.15 is:
25.14999999999999857891452847979962825775146484375

The exact value of this number is 25.14999999999999857891452847979962825775146484375

Any program that uses IEEE754 floating point values internally that tells you the number is 25.15 is rounding. Any claim to the contrary is a lie.

If you tell python to round 25.15 to one decimal place you get 25.1 since you are actually telling it to round 25.14999999999999857891452847979962825775146484375

Python:

>>> print('{:.1f}'.format( 25.15 ))
25.1

BoB

EDIT> Mark44 and Jarvis323 posted while I was typing so most of this is redundant.

Zap · Feb 11, 2021

That's displaying a string value, no?

Python does represent some floating point numbers as exact results, even though behind the scenes they are approximations in binary. The same is true for SQL and Excel. Otherwise, you could not assign a floating point value of 25.15 to a cell or variable, but you can.

This documentation alludes to '{:n}'.format as a way of enforcing the false representation of approximated floating point numbers as exact values. From what I can tell, it is doing that. The floating point errors encountered after reading the Excel sheets disappeared after applying float( '{:n}'.format( x ) ), so I'm not in a rush to try another method.

I will double check to see if there were any truncation errors, but it seemed to have worked beautifully.

Zap · Feb 11, 2021

Jarvis323 said:

Anyways, I think that rounding the numbers this way is fine. But note that lambda x : float( '{:n}'.format( float( x ) ) ) is pointless. In the case of 25.15, stored as 25.149999999999977, it will just round and convert it to a string, and then convert it back to a float with the same value of 25.149999999999977. It doesn't change anything at all. In some cases, as in the case of y in my example (where the rounded value is less correct than the stored value), it will actually just cause you to lose some precision.

It changes it into a floating representation of 25.15, which is a display of 25.15 for a value of 25.149999999999977.

Zap · Feb 11, 2021

rbelli1 said:
Python and some SQL databases have the decimal type. This will avoid the floating point errors. A quick look at pandas indicates that you may be able to use boxed decimals but it doesn't support them natively.
What else would it be doing?

The closest value that a double precision IEEE754 floating point number can get to 25.15 is:
25.14999999999999857891452847979962825775146484375

The exact value of this number is 25.14999999999999857891452847979962825775146484375

Any program that uses IEEE754 floating point values internally that tells you the number is 25.15 is rounding. Any claim to the contrary is a lie.

If you tell python to round 25.15 to one decimal place you get 25.1 since you are actually telling it to round 25.14999999999999857891452847979962825775146484375
Python:
>>> print('{:.1f}'.format( 25.15 ))
25.1
BoB

EDIT> Mark44 and Jarvis323 posted while I was typing so most of this is redundant.

You can't round to 25.15 , because you can't calculate 25.15 in binary. If you see a "rounded" 25.15 floating point number, the program has hid the nasty floating point error and is displaying 25.15 to you instead, even though the true value is something like 25.1499999999999 ...

rbelli1 · Feb 11, 2021

Zap said:

You can't round to 25.15 , because you can't calculate 25.15 in binary.

When we round the "ugly" floating point value to the string value "25.15" we certainly can do that exactly in binary.

Python:

>>> print('{:.2f}'.format( 25.14999999999999857891452847979962825775146484375 ))
25.15

double 0x4039266666666666 gets converted to the series of bytes 0x32, 0x35, 0x2E, 0x31, 0x35.

BoB

Zap · Feb 12, 2021

You can round to a floating point number of 25.15, but it's not really 25.15, it is displayed as 25.15, and it's not a string. It's a floating point number. Try it for yourself, or read the documentation. You can display the floating point number as 25.15, even though it's something else, for the hundredth time!

Mark44 · Feb 12, 2021

Zap said:

If you see a "rounded" 25.15 floating point number, the program has hid the nasty floating point error and is displaying 25.15 to you instead, even though the true value is something like 25.1499999999999 ...

You do know that 25.1499999999999... is exactly equal to 25.15, right? The ellipsis (...) means that the 9 digits repeat that pattern forever.

As someone already asked, do you really need more than 13 decimal digits of precision in your application? I can't think of any measurements that would be that precise.

Zap · Feb 12, 2021

Are you reading this thread? Do you seriously think I meant an infinitely repeating number on a computer? I am not interested in increasing precision. Please read the thread before responding.

Zap · Feb 12, 2021

Python:

>>> x = 25.149999999999977
>>> x
25.149999999999977
>>> float( '{:n}'.format( x ) )
25.15
>>> x = float( '{:n}'.format( x ) )
>>> type( x )
<class 'float'>
>>> x
25.15
>>> x = 0.3 - 0.2
>>> x
0.09999999999999998
>>> float( '{:n}'.format( x ) )
0.1
>>> x = float( '{:n}'.format( x ) )
>>> x
0.1
>>> type( x )
<class 'float'>
>>>

Zap · Feb 12, 2021

The whole point was representing the float as 25.15. The point I'm trying to make is that the floating point number is being represented exactly as 25.15, even though in the background it is still a binary approximation, and float( '{:n}'.format( x ) ) is coercing Python to represent the binary approximation as an exact value. How many times do I have to say this?

Zap · Feb 12, 2021

You can't round to 25.15 ...

The number is being displayed as 25.15, it isn't being rounded. You can't calculate 25.15 in binary. So, how is Python giving me a floating point number of 25.15? Because it's being DISPLAYED AS 25.15 AND NOT ACTUALLY 25.15!

Baluncore · Feb 12, 2021

Zap said:

I noticed that when reading the file, there are floating point errors in the results. For example, a cell that contains 25.15 will be 25.149999999999977 in the pandas data frame.

For double precision floating point; 25.15 = 25.149999999999977;
That is NOT an error. It is a failure to format and round the output.

If you look at the excel formatting information for numbers in the column that shows 25.15, you will probably see it is specified to display as "xxx.xx"; which will show the number rounded to 2 decimal places. Excel is storing the double precision number as 25.149999999999977 internally.

If you want to control the way a double precision number appears in pandas, you will need to read the pandas manual on display formatting, which probably uses the Round(,) format function. Like Excel, pandas is also storing the double precision number as 25.149999999999977 internally.
https://www.xspdf.com/resolution/51898492.html

Vanadium 50 · Feb 12, 2021

Zap said:

Are you reading this thread?

Are you?

Lots of good stuff here.

Mark44 · Feb 12, 2021

Zap said:

Are you reading this thread? Do you seriously think I meant an infinitely repeating number on a computer? I am not interested in increasing precision. Please read the thread before responding.

I have read the entire thread. No, I don't believe you meant an infinitely repeating number, but that's what you wrote:

Zap said:

even though the true value is something like 25.1499999999999 ...

As I recall, you have a degree in physics, so you should know exactly what the ellipsis means in 25.1499999999999 ...

Zap · Feb 12, 2021

Strange.

There is a way to coerce Python to represent numbers with floating point errors as exact values, even though internally they are being stored as a binary approximation, as Baluncore has said.

It is not the same as rounding, because you cannot round to a number that is impossible to calculate in binary.

So for example, if you have a floating point number of 25.149999999999977, you cannot round it to 25.15, because 25.15 cannot be calculated exactly in binary. If you attempt to round it to 25.15, Python will display the value as 25.15, even though internally it is still something like 25.149999999999977. So, it's not really rounding but coercing Python to represent the binary approximation as an exact value.

However, as stated earlier, I don't want to use the round() function, because it's impossible to tell the computer in a simple way at which decimal place to round or which numbers have floating point errors and which do not, and I don't actually want to round numbers.

So, instead, there is a way to coerce Python to display certain numbers that contain floating point errors as exact results while keeping the original value intact, which is not rounding.

Mark44 · Feb 12, 2021

Zap said:

So for example, if you have a floating point number of 25.149999999999977, you cannot round it to 25.15, because 25.15 cannot be calculated exactly in binary.

You're confusing the internal representation of a number (e.g., 25.149999999999977) with its displayed value, which in your example is the displayed value of 25.15.
The value that is displayed is a string of numerical digit characters together with a decimal point.

Zap said:

If you attempt to round it to 25.15, Python will display the value as 25.15, even though internally it is still something like 25.149999999999977. So, it's not really rounding but coercing Python to represent the binary approximation as an exact value.

The number 25.149999999999977 does not exist in memory. Instead, there is the binary representation, most likely in 64 bits. This 64-bit value is rounded or truncated to the desired precision and converted to a string of characters by print() or format(), but the underlying number is not changed.

The Python code I showed a while back shows a sequence of displayed values for the variable x, with varying numbers of decimal digits. At no time is the underlying value of x changed due to rounding or truncation. The only things that change are the (string) representations of this number according to how it's being formatted.

Why Are There Floating Point Errors When Parsing Data in Pandas?

Similar threads

Hot Threads

Recent Insights