Understand Binning Data in Python

EngWiPy · Dec 6, 2017

Hello,

I was reading an example on binning data, where a continuous variable is transformed into a categorical variable. The dataframe name is df, and the continuous variable's column's name is 'horsepower'. We would like to transform the continuous variable feature into a categorical feature with three values: low, medium, and high, and put the result in a new feature called 'horsepower_cat'. The lines of code for this to be done are:

Python:

binwidth = (df['horsepower'].max() - df['horsepower'].min())/4 #why 4 not 3??

bins = np.arange(df['horsepower'].max(), df['horsepower].min(), binwidth) #np is from import numpy as np

group_names = ['low', 'medium', 'high']

df['horsepower_cat'] = pd.cut(df['horsepower'], bins, labels = group_names, include_lowest = True) #pd is from import pandas as pd

The values are:

Python:

df['horsepower'].min():  48.0
df['horsepower'].max(): 262.0

binwidth: 53.5

bins: array([48.0,  101.5,  155. ,  208.5])

From the variable bins not the whole range is included and we have three bins up to 208.5! Why? Why did we divide the range by 4 in the first line, although we want 3 equal bins? The example says the following:

We would like four bins of equal size bandwidth,the forth is because the function "cut" include the rightmost edge.

What does this mean?

Could anyone help me understand this?

Mark44 · Dec 6, 2017

S_David said:
Hello,

I was reading an example on binning data, where a continuous variable is transformed into a categorical variable. The dataframe name is df, and the continuous variable's column's name is 'horsepower'. We would like to transform the continuous variable feature into a categorical feature with three values: low, medium, and high, and put the result in a new feature called 'horsepower_cat'. The lines of code for this to be done are:
Python:
binwidth = (df['horsepower'].max() - df['horsepower'].min())/4 #why 4 not 3??

bins = np.arange(df['horsepower'].max(), df['horsepower].min(), binwidth) #np is from import numpy as np

group_names = ['low', 'medium', 'high']

df['horsepower_cat'] = pd.cut(df['horsepower'], bins, labels = group_names, include_lowest = True) #pd is from import pandas as pd
The values are:
Python:
df['horsepower'].min():  48.0
df['horsepower'].max(): 262.0

binwidth: 53.5

bins: array([48.0,  101.5,  155. ,  208.5])
From the variable bins not the whole range is included and we have three bins up to 208.5! Why? Why did we divide the range by 4 in the first line, although we want 3 equal bins? The example says the following: What does this mean?

Could anyone help me understand this?

The bin separators include the minimum and maximum, plus two more equally spaced separators -- four numbers. These separators give you three bins: 48.0 to 101.5, 101.5 to 155, 155 to 208.5.

This is the classic fencepost problem. If you put 10 fenceposts along a property line, there will be 9 sections defined by the fenceposts.

EngWiPy · Dec 7, 2017

Mark44 said:

The bin separators include the minimum and maximum, plus two more equally spaced separators -- four numbers. These separators give you three bins: 48.0 to 101.5, 101.5 to 155, 155 to 208.5.

This is the classic fencepost problem. If you put 10 fenceposts along a property line, there will be 9 sections defined by the fenceposts.

Yes, right, but these three bins don't reach to the maximum value. Suppose I have a set of numbers between [X_min, and X_max]. Wouldn't we divide the range by 3 like (X_max - X_min)/3 to get three equally spaced bins? For example, if X_min = 1 and X_max = 10, the binwidth = (10 - 1)/3 = 3. So we will have one bin from 1 to 4 and one from 4 to 7 and one from 7 to 10 which are three bins that have equal width. In the example I first posted, the maximum value is not included in the bins' boundaries, and the range was divided by 4 not 3. Why? It is said that it has to do with cut method of pandas library, but I am not sure how!

willem2 · Dec 7, 2017

You are right, if I type this in pythhon

Code:

from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

binwidth = (horsepower.max() - horsepower.min())/4

bins = arange(horsepower.min(),horsepower.max(), binwidth)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))

you get [low, medium, NaN, medium, NaN, high] Anything over 208.5 will fall outside of the range and produde NaN.
If you however divide by 3 you get: ValueError: Bin labels must be one fewer than the number of bin edges
Cut() needs an array of 4 numbers to define the borders of 3 intervals.
The problem is in the arange function which does not include the maximum of the range. According to the documentation at scipy.org:

When using a non-integer step, such as 0.1, the results will often not be consistent. It is better to use linspace for these cases.

Using linspace where you can choose yourself how many points you need is easier.

Code:

from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

bins = linspace(horsepower.min(),horsepower.max(), 4)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))

EngWiPy · Dec 7, 2017

willem2 said:
You are right, if I type this in pythhon
Code:
from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

binwidth = (horsepower.max() - horsepower.min())/4

bins = arange(horsepower.min(),horsepower.max(), binwidth)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))
you get [low, medium, NaN, medium, NaN, high] Anything over 208.5 will fall outside of the range and produde NaN.
If you however divide by 3 you get: ValueError: Bin labels must be one fewer than the number of bin edges
Cut() needs an array of 4 numbers to define the borders of 3 intervals.
The problem is in the arange function which does not include the maximum of the range. According to the documentation at scipy.org:Using linspace where you can choose yourself how many points you need is easier.
Code:

from numpy import *[/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT] [SIZE=4][COLOR=rgb(5, 5, 5)] [LEFT][SIZE=4][COLOR=rgb(5, 5, 5)] [LEFT][SIZE=4][COLOR=rgb(5, 5, 5)] [LEFT][SIZE=4][COLOR=rgb(5, 5, 5)] [LEFT][SIZE=4][COLOR=rgb(5, 5, 5)] [LEFT][SIZE=4][COLOR=rgb(5, 5, 5)] [LEFT][SIZE=4][COLOR=rgb(5, 5, 5)] [LEFT]from pandas import * group_names = array(['low', 'medium', 'high']) horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0]) bins = linspace(horsepower.min(),horsepower.max(), 4) print(cut(horsepower, bins, labels = group_names, include_lowest = True))

Thanks. Now I know what it meant that

Code:

 ..., the forth is because the function "cut" include the rightmost edge.

but the example didn't handle it correctly, it seems. How you suggested it works perfectly and makes perfect sense. I counted each category using the first method as follows:

Python:

df['horsepower_cat'].value_counts()

and the output was:

Python:

low       115
medium     62
high       23

Comparing these numbers with the histogram figure, shows the numbers aren't correct.

With the linspace method, the numbers are:

Python:

low       153
medium     43
high        5

which are in alignment with the histogram.

I think the example I am reading contains some mistakes. After trying to bin the data as above, the author tries to draw the histogram as follows:

Python:

%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(df['horsepower'], bins = 3)

However, df['horsepower'] is not a one dimensional array, since it includes the indices, and its data type in the original data is object. So, the above code gave an error, and I had to modify it as:

Python:

%matplotlib inline#what does this do by the way?*************
import matplotlib.pyplot as plt

horsepower = df['horsepower'].tolist()#to convert the data frame to list.
plt.hist([float(t) for t in horsepower], bins = 3)

for it to work.

Thanks again.

Mark44 · Dec 7, 2017

S_David said:

Yes, right, but these three bins don't reach to the maximum value.

My mistake. I was thinking that 208.5 was the max value, even though your code showed a larger value.

Understand Binning Data in Python

Attachments

1. What is binning data in Python?

2. How is binning data different from grouping data?

3. What are the benefits of binning data in Python?

4. How do I perform binning in Python?

5. Can I customize the bins in binning data in Python?

Hot Threads

Recent Insights