Categorical vs Numerical Variables

In summary, In ordinal categorical variables, like "star rating" (1-5), the value of a particular category is not necessarily related to the numeric values that are assigned to other categories.
  • #1
fog37
1,568
108
TL;DR Summary
Clearly understand the difference between categorical and numerical variables...
Hello,

I am generally clear on the distinction between numerical and nonnumerical (also called categorical or qualitative) variables but I still have some doubts in some regards.

A numerical variable (continuous or discrete) has a value that derives from a measurement procedure (using a tool) or from counting, I would say.
Obviously, height and weight are continuous numerical variables (we use tools to get their values). The number of passengers on a plane is discrete numerical variable even if we don't use a tool to determine that number. What we can do we numerical variables is math (calculate the mean, the mode, the median).

In regards to categorical variables, they are variables with a finite number of labels, they belong to a finite number of groups (2 or more). The labels are generally text but the labels can also be numbers (zip code, etc.) which don't really have a mathematical meaning, I would say.

Ordinal (nominal too?) categorical variables appear to be similar to discrete numerical variables because they have a finite number of values. What is the main difference then? What is the criterion to determine that? For categorical variables, we can calculate the frequency of a certain label/class and the only measure of central tendency is the mode (we cannot computer mean and median for nominal or ordinal qualitative variable)...

Here my dilemma: star rating (1 star to 5 star) is generally considered an ordinal categorical variables. However we could take all the ratings provided by 4 customers (4,3,4,1) and do compute the average star rating (4+3+4+1)/4 =3.

So can a variable be categorical and also numerical at the same time? I wouldn't think so. "Star rating" is categorical. But "average star rating" is a different variable and is numerical...Is that correct?

Thanks for any clarification.
 
Physics news on Phys.org
  • #2
You can distinguish a set of categories that represent different levels of some ordered variable from a set of categories that should not be ordered. For instance, the set of weight categories {underweight, normal, overweight, obese} is different from the set of animal types (cat, dog, horse, penguin}.
You will need to use good judgment about whether you should give numerical values to the ordered set. Assigning values of 1,2,3,4 to {underweight, normal, overweight, obese} might be used in many statistical algorithms as implying that the value assignment represents a linear relationship.
 
  • #3
FactChecker makes a very good point that we can apply directly to your scenario:

fog37 said:
we could take all the ratings provided by 4 customers (4,3,4,1) and do compute the average star rating (4+3+4+1)/4 =3.
Who is to say that "four stars" is 4 times as valuable as "one star"? You're assuming a linear relationship. That is unwarranted. It could be logarithmic or geometric - heck, it could be arbitrary. Here's how I might break down stars:

One star: I dislike this film: -1
Two stars: this film is average: 0
Three stars: I like this film: 4
Four stars: This is a classic: 11

What does that do to your average now?In fact, I would hazard that the reason for using stars instead of numbers is to highlight that there's no numeric (and certainly no linear) connection between the ratings.
 
  • Like
Likes gleem and FactChecker
  • #4
DaveC426913 said:
In fact, I would hazard that the reason for using stars instead of numbers is to highlight that there's no numeric (and certainly no linear) connection between the ratings.

On the other hand, if the thing I care about is how much fuel it takes to fly from new york to Albuquerque, there's no linear relationship with the passengers/4 passengers isn't 4 times as bad as 1 passenger (e.g. most of the fuel cost is the constant cost of flying the plane). But you would be crazy to argue this means you can never compute an expected value of the number of people on a plane.

It's good to wonder if statistical computations are giving meaningful results, but someone writing ordinal. Vs numeric next to some data doesn't mean they have answered the question for you.
 
  • #5
DaveC426913 said:
One star: I dislike this film: -1
Two stars: this film is average: 0
Three stars: I like this film: 4
Four stars: This is a classic: 11

What does that do to your average now?
It gives a mean for { 4, 3, 4, 1 } of (11 + 4 + 11 - 1) / 4 = 6.75 which is somewhere between 3 and 4 stars (and this can be illustrated using an interpolation).

The mistake here lies in believing that there are only two sorts of data. We need to consider 3 sorts:
  • If all we have is labels (e.g. red, blue, yellow) then the data is categorical and we can't do much with it.
  • If there is an ordering (e.g. underweight, normal, overweight, obese) then we can rank the data [Edit: which means that we can usually find the median] but we still can't calculate the mean [Edit: or median] unless...
  • If we can assign numerical values (either continuous, e.g. weight, or discrete, e.g. 1 star = -1, 2 stars = 4 etc. - strictly speaking we need a measure) then we can calculate any statistic we want.
 
Last edited:
  • Like
Likes PeroK and FactChecker
  • #6
fog37 said:
TL;DR Summary: Clearly understand the difference between categorical and numerical variables...

Hello,

I am generally clear on the distinction between numerical and nonnumerical (also called categorical or qualitative) variables but I still have some doubts in some regards.

A numerical variable (continuous or discrete) has a value that derives from a measurement procedure (using a tool) or from counting, I would say.
Obviously, height and weight are continuous numerical variables (we use tools to get their values). The number of passengers on a plane is discrete numerical variable even if we don't use a tool to determine that number. What we can do we numerical variables is math (calculate the mean, the mode, the median).

In regards to categorical variables, they are variables with a finite number of labels, they belong to a finite number of groups (2 or more). The labels are generally text but the labels can also be numbers (zip code, etc.) which don't really have a mathematical meaning, I would say.

Ordinal (nominal too?) categorical variables appear to be similar to discrete numerical variables because they have a finite number of values. What is the main difference then? What is the criterion to determine that? For categorical variables, we can calculate the frequency of a certain label/class and the only measure of central tendency is the mode (we cannot computer mean and median for nominal or ordinal qualitative variable)...

Here my dilemma: star rating (1 star to 5 star) is generally considered an ordinal categorical variables. However we could take all the ratings provided by 4 customers (4,3,4,1) and do compute the average star rating (4+3+4+1)/4 =3.

So can a variable be categorical and also numerical at the same time? I wouldn't think so. "Star rating" is categorical. But "average star rating" is a different variable and is numerical...Is that correct?

Thanks for any clarification.
I may be misreading here but this:
Ordinal (nominal too?) categorical variables appear to be similar to discrete numerical variables because they have a finite number of values.
contains a mistake: discrete numerical variables are not limited to a finite number of values: a very simple counter example: X = # flips of a coin required to see the first head. The values are 1, 2, 3, 4, 5, ...
Although it is exceedingly unlikely [words fail me in an attempt to stress just how unlikely] to observe X = 1000, it is theoretically possible: in short, there is no largest value of X imposed by this experiment.

So can a variable be categorical and also numerical at the same time?
No -- it is important to stress that. The fact that people treat categorical variables as numerical [this occurs most often with ratings: 1 through 5, for example] those numbers are simply codes for opinions. Treating them as numerical for purposes of averages or other sample statistics is always wrong.
 

1. What is the difference between categorical and numerical variables?

Categorical variables are qualitative variables that represent categories or groups. They have no inherent numerical value and can only be described by their label. On the other hand, numerical variables are quantitative variables that represent numerical values and can be measured and compared mathematically.

2. How are categorical variables represented in data?

Categorical variables are typically represented by labels or names, such as "gender" or "type of car". They can also be represented by numbers, but these numbers have no mathematical significance and are simply used to differentiate between categories.

3. What types of statistical analyses are appropriate for categorical and numerical variables?

Categorical variables are often analyzed using descriptive statistics, such as frequency tables and bar charts, while numerical variables can be analyzed using both descriptive and inferential statistics, such as mean, standard deviation, and t-tests.

4. Can a variable be both categorical and numerical?

No, a variable can only be one or the other. For example, "age" can be represented as a numerical variable, while "gender" can be represented as a categorical variable.

5. How can I determine if a variable is categorical or numerical?

The easiest way to determine if a variable is categorical or numerical is to ask yourself if it represents a category or a numerical value. If it represents a category, it is categorical. If it represents a numerical value, it is numerical.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
17
Views
5K
Replies
12
Views
731
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
704
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
2K
  • Programming and Computer Science
Replies
5
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
2K
Back
Top