Understand Binning Data in Python

  • Context: Python 
  • Thread starter Thread starter EngWiPy
  • Start date Start date
  • Tags Tags
    Data Python
Click For Summary

Discussion Overview

The discussion revolves around the process of binning continuous data into categorical data using Python, specifically focusing on the 'horsepower' variable in a dataframe. Participants explore the rationale behind the choice of binning methods, the implications of using different functions for creating bins, and the resulting categorizations.

Discussion Character

  • Technical explanation
  • Conceptual clarification
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants question why the bin width is calculated by dividing the range by 4 instead of 3, given that they want three bins.
  • There is mention of the "fencepost problem," where the number of bins is one less than the number of bin edges, leading to confusion about the maximum value being included.
  • One participant suggests that using the `linspace` function may be more appropriate than `arange` for creating bins, as it allows for better control over the number of points.
  • Another participant points out that using `cut` requires the number of bin labels to be one fewer than the number of bin edges, which is why four edges are needed for three bins.
  • There is a discussion about discrepancies in the counts of categories when comparing results from different binning methods and how they align with histogram outputs.
  • One participant notes an error in the original example regarding the handling of the dataframe and its conversion to a list for histogram plotting.

Areas of Agreement / Disagreement

Participants express differing views on the appropriate method for binning and the implications of using different functions. There is no consensus on the correctness of the original example, and multiple competing views remain regarding the best practices for binning data in Python.

Contextual Notes

Participants highlight limitations in the original example, including potential mistakes in the binning process and issues with data type handling when plotting histograms.

EngWiPy
Messages
1,361
Reaction score
61
Hello,

I was reading an example on binning data, where a continuous variable is transformed into a categorical variable. The dataframe name is df, and the continuous variable's column's name is 'horsepower'. We would like to transform the continuous variable feature into a categorical feature with three values: low, medium, and high, and put the result in a new feature called 'horsepower_cat'. The lines of code for this to be done are:

Python:
binwidth = (df['horsepower'].max() - df['horsepower'].min())/4 #why 4 not 3??

bins = np.arange(df['horsepower'].max(), df['horsepower].min(), binwidth) #np is from import numpy as np

group_names = ['low', 'medium', 'high']

df['horsepower_cat'] = pd.cut(df['horsepower'], bins, labels = group_names, include_lowest = True) #pd is from import pandas as pd

The values are:

Python:
df['horsepower'].min():  48.0
df['horsepower'].max(): 262.0

binwidth: 53.5

bins: array([48.0,  101.5,  155. ,  208.5])

From the variable bins not the whole range is included and we have three bins up to 208.5! Why? Why did we divide the range by 4 in the first line, although we want 3 equal bins? The example says the following:
We would like four bins of equal size bandwidth,the forth is because the function "cut" include the rightmost edge.
What does this mean?

Could anyone help me understand this?
 
Technology news on Phys.org
S_David said:
Hello,

I was reading an example on binning data, where a continuous variable is transformed into a categorical variable. The dataframe name is df, and the continuous variable's column's name is 'horsepower'. We would like to transform the continuous variable feature into a categorical feature with three values: low, medium, and high, and put the result in a new feature called 'horsepower_cat'. The lines of code for this to be done are:

Python:
binwidth = (df['horsepower'].max() - df['horsepower'].min())/4 #why 4 not 3??

bins = np.arange(df['horsepower'].max(), df['horsepower].min(), binwidth) #np is from import numpy as np

group_names = ['low', 'medium', 'high']

df['horsepower_cat'] = pd.cut(df['horsepower'], bins, labels = group_names, include_lowest = True) #pd is from import pandas as pd

The values are:

Python:
df['horsepower'].min():  48.0
df['horsepower'].max(): 262.0

binwidth: 53.5

bins: array([48.0,  101.5,  155. ,  208.5])

From the variable bins not the whole range is included and we have three bins up to 208.5! Why? Why did we divide the range by 4 in the first line, although we want 3 equal bins? The example says the following: What does this mean?

Could anyone help me understand this?
The bin separators include the minimum and maximum, plus two more equally spaced separators -- four numbers. These separators give you three bins: 48.0 to 101.5, 101.5 to 155, 155 to 208.5.

This is the classic fencepost problem. If you put 10 fenceposts along a property line, there will be 9 sections defined by the fenceposts.
 
Mark44 said:
The bin separators include the minimum and maximum, plus two more equally spaced separators -- four numbers. These separators give you three bins: 48.0 to 101.5, 101.5 to 155, 155 to 208.5.

This is the classic fencepost problem. If you put 10 fenceposts along a property line, there will be 9 sections defined by the fenceposts.

Yes, right, but these three bins don't reach to the maximum value. Suppose I have a set of numbers between [X_min, and X_max]. Wouldn't we divide the range by 3 like (X_max - X_min)/3 to get three equally spaced bins? For example, if X_min = 1 and X_max = 10, the binwidth = (10 - 1)/3 = 3. So we will have one bin from 1 to 4 and one from 4 to 7 and one from 7 to 10 which are three bins that have equal width. In the example I first posted, the maximum value is not included in the bins' boundaries, and the range was divided by 4 not 3. Why? It is said that it has to do with cut method of pandas library, but I am not sure how!
 
You are right, if I type this in pythhon

Code:
from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

binwidth = (horsepower.max() - horsepower.min())/4

bins = arange(horsepower.min(),horsepower.max(), binwidth)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))
you get [low, medium, NaN, medium, NaN, high] Anything over 208.5 will fall outside of the range and produde NaN.
If you however divide by 3 you get: ValueError: Bin labels must be one fewer than the number of bin edges
Cut() needs an array of 4 numbers to define the borders of 3 intervals.
The problem is in the arange function which does not include the maximum of the range. According to the documentation at scipy.org:
When using a non-integer step, such as 0.1, the results will often not be consistent. It is better to use linspace for these cases.

Using linspace where you can choose yourself how many points you need is easier.

Code:
from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

bins = linspace(horsepower.min(),horsepower.max(), 4)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))

 
  • Like
Likes   Reactions: EngWiPy
willem2 said:
You are right, if I type this in pythhon

Code:
from numpy import *
from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

binwidth = (horsepower.max() - horsepower.min())/4

bins = arange(horsepower.min(),horsepower.max(), binwidth)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))
you get [low, medium, NaN, medium, NaN, high] Anything over 208.5 will fall outside of the range and produde NaN.
If you however divide by 3 you get: ValueError: Bin labels must be one fewer than the number of bin edges
Cut() needs an array of 4 numbers to define the borders of 3 intervals.
The problem is in the arange function which does not include the maximum of the range. According to the documentation at scipy.org:Using linspace where you can choose yourself how many points you need is easier.

Code:
from numpy import *[/LEFT][/LEFT][/LEFT][/LEFT][/LEFT][/LEFT][/LEFT]

[LEFT]
[LEFT]
[LEFT]
[LEFT]
[LEFT]
[LEFT]
[LEFT]from pandas import *

group_names = array(['low', 'medium', 'high'])

horsepower = array([48.0, 117.0, 262.0, 139.0, 223.0, 173.0])

bins = linspace(horsepower.min(),horsepower.max(), 4)

print(cut(horsepower, bins, labels = group_names, include_lowest = True))

Thanks. Now I know what it meant that

Code:
 ..., the forth is because the function "cut" include the rightmost edge.

but the example didn't handle it correctly, it seems. How you suggested it works perfectly and makes perfect sense. I counted each category using the first method as follows:

Python:
df['horsepower_cat'].value_counts()

and the output was:

Python:
low       115
medium     62
high       23

Comparing these numbers with the histogram figure, shows the numbers aren't correct.

testplot.png

With the linspace method, the numbers are:

Python:
low       153
medium     43
high        5

which are in alignment with the histogram.

I think the example I am reading contains some mistakes. After trying to bin the data as above, the author tries to draw the histogram as follows:

Python:
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(df['horsepower'], bins = 3)

However, df['horsepower'] is not a one dimensional array, since it includes the indices, and its data type in the original data is object. So, the above code gave an error, and I had to modify it as:

Python:
%matplotlib inline#what does this do by the way?*************
import matplotlib.pyplot as plt

horsepower = df['horsepower'].tolist()#to convert the data frame to list.
plt.hist([float(t) for t in horsepower], bins = 3)

for it to work.

Thanks again.
 

Attachments

  • testplot.png
    testplot.png
    2.2 KB · Views: 1,655
Last edited:
S_David said:
Yes, right, but these three bins don't reach to the maximum value.
My mistake. I was thinking that 208.5 was the max value, even though your code showed a larger value.
 
  • Like
Likes   Reactions: EngWiPy