Python Understand Binning Data in Python

  • Thread starter Thread starter EngWiPy
  • Start date Start date
  • Tags Tags
    Data Python
AI Thread Summary
Binning data in Python involves transforming a continuous variable, such as 'horsepower', into categorical values like low, medium, and high. The confusion arises from using four bins to create three categories, which is necessary because the `pd.cut()` function requires one more bin edge than the number of categories. The initial code incorrectly used `np.arange`, which does not include the maximum value, leading to potential NaN values for data outside the defined bins. Instead, using `np.linspace` allows for better control over bin edges and ensures all data points are included. The example discussed highlights the importance of correctly defining bin boundaries to align with histogram outputs.
EngWiPy
Messages
1,361
Reaction score
61
Hello,

I was reading an example on binning data, where a continuous variable is transformed into a categorical variable. The dataframe name is df, and the continuous variable's column's name is 'horsepower'. We would like to transform the continuous variable feature into a categorical feature with three values: low, medium, and high, and put the result in a new feature called 'horsepower_cat'. The lines of code for this to be done are:

Python:
binwidth = (df['horsepower'].max() - df['horsepower'].min())/4 #why 4 not 3??

bins = np.arange(df['horsepower'].max(), df['horsepower].min(), binwidth) #np is from import numpy as np

group_names = ['low', 'medium', 'high']

df['horsepower_cat'] = pd.cut(df['horsepower'], bins, labels = group_names, include_lowest = True) #pd is from import pandas as pd

The values are:

Python:
df['horsepower'].min():  48.0
df['horsepower'].max(): 262.0

binwidth: 53.5

bins: array([48.0,  101.5,  155. ,  208.5])

From the variable bins not the whole range is included and we have three bins up to 208.5! Why? Why did we divide the range by 4 in the first line, although we want 3 equal bins? The example says the following:
We would like four bins of equal size bandwidth,the forth is because the function "cut" include the rightmost edge.
What does this mean?

Could anyone help me understand this?
 
Technology news on Phys.org
S_David said:
Hello,

I was reading an example on binning data, where a continuous variable is transformed into a categorical variable. The dataframe name is df, and the continuous variable's column's name is 'horsepower'. We would like to transform the continuous variable feature into a categorical feature with three values: low, medium, and high, and put the result in a new feature called 'horsepower_cat'. The lines of code for this to be done are:

Python:
binwidth = (df['horsepower'].max() - df['horsepower'].min())/4 #why 4 not 3??

bins = np.arange(df['horsepower'].max(), df['horsepower].min(), binwidth) #np is from import numpy as np

group_names = ['low', 'medium', 'high']

df['horsepower_cat'] = pd.cut(df['horsepower'], bins, labels = group_names, include_lowest = True) #pd is from import pandas as pd

The values are:

Python:
df['horsepower'].min():  48.0
df['horsepower'].max(): 262.0

binwidth: 53.5

bins: array([48.0,  101.5,  155. ,  208.5])

From the variable bins not the whole range is included and we have three bins up to 208.5! Why? Why did we divide the range by 4 in the first line, although we want 3 equal bins? The example says the following: What does this mean?

Could anyone help me understand this?
The bin separators include the minimum and maximum, plus two more equally spaced separators -- four numbers. These separators give you three bins: 48.0 to 101.5, 101.5 to 155, 155 to 208.5.

This is the classic fencepost problem. If you put 10 fenceposts along a property line, there will be 9 sections defined by the fenceposts.
 
Mark44 said:
The bin separators include the minimum and maximum, plus two more equally spaced separators -- four numbers. These separators give you three bins: 48.0 to 101.5, 101.5 to 155, 155 to 208.5.

This is the classic fencepost problem. If you put 10 fenceposts along a property line, there will be 9 sections defined by the fenceposts.

Yes, right, but these three bins don't reach to the maximum value. Suppose I have a set of numbers between [X_min, and X_max]. Wouldn't we divide the range by 3 like (X_max - X_min)/3 to get three equally spaced bins? For example, if X_min = 1 and X_max = 10, the binwidth = (10 - 1)/3 = 3. So we will have one bin from 1 to 4 and one from 4 to 7 and one from 7 to 10 which are three bins that have equal width. In the example I first posted, the maximum value is not included in the bins' boundaries, and the range was divided by 4 not 3. Why? It is said that it has to do with cut method of pandas library, but I am not sure how!
 
You are right, if I type this in pythhon

Code:
from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

binwidth = (horsepower.max() - horsepower.min())/4

bins = arange(horsepower.min(),horsepower.max(), binwidth)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))
you get [low, medium, NaN, medium, NaN, high] Anything over 208.5 will fall outside of the range and produde NaN.
If you however divide by 3 you get: ValueError: Bin labels must be one fewer than the number of bin edges
Cut() needs an array of 4 numbers to define the borders of 3 intervals.
The problem is in the arange function which does not include the maximum of the range. According to the documentation at scipy.org:
When using a non-integer step, such as 0.1, the results will often not be consistent. It is better to use linspace for these cases.

Using linspace where you can choose yourself how many points you need is easier.

Code:
from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

bins = linspace(horsepower.min(),horsepower.max(), 4)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))

 
  • Like
Likes EngWiPy
willem2 said:
You are right, if I type this in pythhon

Code:
from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

binwidth = (horsepower.max() - horsepower.min())/4

bins = arange(horsepower.min(),horsepower.max(), binwidth)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))
you get [low, medium, NaN, medium, NaN, high] Anything over 208.5 will fall outside of the range and produde NaN.
If you however divide by 3 you get: ValueError: Bin labels must be one fewer than the number of bin edges
Cut() needs an array of 4 numbers to define the borders of 3 intervals.
The problem is in the arange function which does not include the maximum of the range. According to the documentation at scipy.org:Using linspace where you can choose yourself how many points you need is easier.

Code:
from numpy import *[/LEFT][/LEFT][/LEFT][/LEFT][/LEFT][/LEFT][/LEFT]

[LEFT]
[LEFT]
[LEFT]
[LEFT]
[LEFT]
[LEFT]
[LEFT]from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

bins = linspace(horsepower.min(),horsepower.max(), 4)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))

Thanks. Now I know what it meant that

Code:
 ..., the forth is because the function "cut" include the rightmost edge.

but the example didn't handle it correctly, it seems. How you suggested it works perfectly and makes perfect sense. I counted each category using the first method as follows:

Python:
df['horsepower_cat'].value_counts()

and the output was:

Python:
low       115
medium     62
high       23

Comparing these numbers with the histogram figure, shows the numbers aren't correct.

testplot.png

With the linspace method, the numbers are:

Python:
low       153
medium     43
high        5

which are in alignment with the histogram.

I think the example I am reading contains some mistakes. After trying to bin the data as above, the author tries to draw the histogram as follows:

Python:
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(df['horsepower'], bins = 3)

However, df['horsepower'] is not a one dimensional array, since it includes the indices, and its data type in the original data is object. So, the above code gave an error, and I had to modify it as:

Python:
%matplotlib inline#what does this do by the way?*************
import matplotlib.pyplot as plt

horsepower = df['horsepower'].tolist()#to convert the data frame to list.
plt.hist([float(t) for t in horsepower], bins = 3)

for it to work.

Thanks again.
 

Attachments

  • testplot.png
    testplot.png
    2.2 KB · Views: 1,630
Last edited:
S_David said:
Yes, right, but these three bins don't reach to the maximum value.
My mistake. I was thinking that 208.5 was the max value, even though your code showed a larger value.
 
  • Like
Likes EngWiPy
I tried a web search "the loss of programming ", and found an article saying that all aspects of writing, developing, and testing software programs will one day all be handled through artificial intelligence. One must wonder then, who is responsible. WHO is responsible for any problems, bugs, deficiencies, or whatever malfunctions which the programs make their users endure? Things may work wrong however the "wrong" happens. AI needs to fix the problems for the users. Any way to...
Back
Top