Understand Binning Data in Python

In summary: True))you get [low, medium, NaN, medium, NaN, high] Anything over 208.5 will fall outside of the range and produde NaN.
  • #1
EngWiPy
1,368
61
Hello,

I was reading an example on binning data, where a continuous variable is transformed into a categorical variable. The dataframe name is df, and the continuous variable's column's name is 'horsepower'. We would like to transform the continuous variable feature into a categorical feature with three values: low, medium, and high, and put the result in a new feature called 'horsepower_cat'. The lines of code for this to be done are:

Python:
binwidth = (df['horsepower'].max() - df['horsepower'].min())/4 #why 4 not 3??

bins = np.arange(df['horsepower'].max(), df['horsepower].min(), binwidth) #np is from import numpy as np

group_names = ['low', 'medium', 'high']

df['horsepower_cat'] = pd.cut(df['horsepower'], bins, labels = group_names, include_lowest = True) #pd is from import pandas as pd

The values are:

Python:
df['horsepower'].min():  48.0
df['horsepower'].max(): 262.0

binwidth: 53.5

bins: array([48.0,  101.5,  155. ,  208.5])

From the variable bins not the whole range is included and we have three bins up to 208.5! Why? Why did we divide the range by 4 in the first line, although we want 3 equal bins? The example says the following:
We would like four bins of equal size bandwidth,the forth is because the function "cut" include the rightmost edge.
What does this mean?

Could anyone help me understand this?
 
Technology news on Phys.org
  • #2
S_David said:
Hello,

I was reading an example on binning data, where a continuous variable is transformed into a categorical variable. The dataframe name is df, and the continuous variable's column's name is 'horsepower'. We would like to transform the continuous variable feature into a categorical feature with three values: low, medium, and high, and put the result in a new feature called 'horsepower_cat'. The lines of code for this to be done are:

Python:
binwidth = (df['horsepower'].max() - df['horsepower'].min())/4 #why 4 not 3??

bins = np.arange(df['horsepower'].max(), df['horsepower].min(), binwidth) #np is from import numpy as np

group_names = ['low', 'medium', 'high']

df['horsepower_cat'] = pd.cut(df['horsepower'], bins, labels = group_names, include_lowest = True) #pd is from import pandas as pd

The values are:

Python:
df['horsepower'].min():  48.0
df['horsepower'].max(): 262.0

binwidth: 53.5

bins: array([48.0,  101.5,  155. ,  208.5])

From the variable bins not the whole range is included and we have three bins up to 208.5! Why? Why did we divide the range by 4 in the first line, although we want 3 equal bins? The example says the following: What does this mean?

Could anyone help me understand this?
The bin separators include the minimum and maximum, plus two more equally spaced separators -- four numbers. These separators give you three bins: 48.0 to 101.5, 101.5 to 155, 155 to 208.5.

This is the classic fencepost problem. If you put 10 fenceposts along a property line, there will be 9 sections defined by the fenceposts.
 
  • #3
Mark44 said:
The bin separators include the minimum and maximum, plus two more equally spaced separators -- four numbers. These separators give you three bins: 48.0 to 101.5, 101.5 to 155, 155 to 208.5.

This is the classic fencepost problem. If you put 10 fenceposts along a property line, there will be 9 sections defined by the fenceposts.

Yes, right, but these three bins don't reach to the maximum value. Suppose I have a set of numbers between [X_min, and X_max]. Wouldn't we divide the range by 3 like (X_max - X_min)/3 to get three equally spaced bins? For example, if X_min = 1 and X_max = 10, the binwidth = (10 - 1)/3 = 3. So we will have one bin from 1 to 4 and one from 4 to 7 and one from 7 to 10 which are three bins that have equal width. In the example I first posted, the maximum value is not included in the bins' boundaries, and the range was divided by 4 not 3. Why? It is said that it has to do with cut method of pandas library, but I am not sure how!
 
  • #4
You are right, if I type this in pythhon

Code:
from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

binwidth = (horsepower.max() - horsepower.min())/4

bins = arange(horsepower.min(),horsepower.max(), binwidth)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))
you get [low, medium, NaN, medium, NaN, high] Anything over 208.5 will fall outside of the range and produde NaN.
If you however divide by 3 you get: ValueError: Bin labels must be one fewer than the number of bin edges
Cut() needs an array of 4 numbers to define the borders of 3 intervals.
The problem is in the arange function which does not include the maximum of the range. According to the documentation at scipy.org:
When using a non-integer step, such as 0.1, the results will often not be consistent. It is better to use linspace for these cases.

Using linspace where you can choose yourself how many points you need is easier.

Code:
from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

bins = linspace(horsepower.min(),horsepower.max(), 4)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))

 
  • Like
Likes EngWiPy
  • #5
willem2 said:
You are right, if I type this in pythhon

Code:
from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

binwidth = (horsepower.max() - horsepower.min())/4

bins = arange(horsepower.min(),horsepower.max(), binwidth)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))
you get [low, medium, NaN, medium, NaN, high] Anything over 208.5 will fall outside of the range and produde NaN.
If you however divide by 3 you get: ValueError: Bin labels must be one fewer than the number of bin edges
Cut() needs an array of 4 numbers to define the borders of 3 intervals.
The problem is in the arange function which does not include the maximum of the range. According to the documentation at scipy.org:Using linspace where you can choose yourself how many points you need is easier.

Code:
from numpy import *[/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT][/COLOR][/SIZE][/LEFT]
[SIZE=4][COLOR=rgb(5, 5, 5)]
[LEFT][SIZE=4][COLOR=rgb(5, 5, 5)]
[LEFT][SIZE=4][COLOR=rgb(5, 5, 5)]
[LEFT][SIZE=4][COLOR=rgb(5, 5, 5)]
[LEFT][SIZE=4][COLOR=rgb(5, 5, 5)]
[LEFT][SIZE=4][COLOR=rgb(5, 5, 5)]
[LEFT][SIZE=4][COLOR=rgb(5, 5, 5)]
[LEFT]from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

bins = linspace(horsepower.min(),horsepower.max(), 4)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))

Thanks. Now I know what it meant that

Code:
 ..., the forth is because the function "cut" include the rightmost edge.

but the example didn't handle it correctly, it seems. How you suggested it works perfectly and makes perfect sense. I counted each category using the first method as follows:

Python:
df['horsepower_cat'].value_counts()

and the output was:

Python:
low       115
medium     62
high       23

Comparing these numbers with the histogram figure, shows the numbers aren't correct.

testplot.png

With the linspace method, the numbers are:

Python:
low       153
medium     43
high        5

which are in alignment with the histogram.

I think the example I am reading contains some mistakes. After trying to bin the data as above, the author tries to draw the histogram as follows:

Python:
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(df['horsepower'], bins = 3)

However, df['horsepower'] is not a one dimensional array, since it includes the indices, and its data type in the original data is object. So, the above code gave an error, and I had to modify it as:

Python:
%matplotlib inline#what does this do by the way?*************
import matplotlib.pyplot as plt

horsepower = df['horsepower'].tolist()#to convert the data frame to list.
plt.hist([float(t) for t in horsepower], bins = 3)

for it to work.

Thanks again.
 

Attachments

  • testplot.png
    testplot.png
    2.2 KB · Views: 1,543
Last edited:
  • #6
S_David said:
Yes, right, but these three bins don't reach to the maximum value.
My mistake. I was thinking that 208.5 was the max value, even though your code showed a larger value.
 
  • Like
Likes EngWiPy

1. What is binning data in Python?

Binning data in Python refers to the process of dividing a continuous variable into smaller groups, or "bins", in order to simplify the data and make it more manageable for analysis. This is commonly used in data analysis and visualization to better understand patterns and trends within the data.

2. How is binning data different from grouping data?

While both binning and grouping involve dividing data into smaller subsets, binning is typically used with continuous data while grouping is used with categorical data. Binning involves creating a set number of bins based on the range of values, while grouping involves grouping data based on pre-defined categories or criteria.

3. What are the benefits of binning data in Python?

There are several benefits of binning data in Python, including simplifying complex data, identifying patterns and trends, and making data more manageable for analysis. Binning can also help to improve the accuracy of statistical models by reducing the impact of outliers.

4. How do I perform binning in Python?

In Python, binning can be performed using libraries such as NumPy, Pandas, and Matplotlib. The process involves creating a set number of bins, or specifying the bin size, and then mapping the data into the appropriate bin. This can be done using functions such as pd.cut() or np.histogram().

5. Can I customize the bins in binning data in Python?

Yes, you can customize the bins in binning data in Python. You can specify the number of bins, the bin size, and the range of values to be included in each bin. You can also choose the labels for the bins and customize the visualization of the binned data using different colors and styles.

Back
Top