Do outliers exist in categorical data and how can they be detected?

  • Context: Graduate 
  • Thread starter Thread starter cynnetje
  • Start date Start date
  • Tags Tags
    Chi square Data
Click For Summary

Discussion Overview

The discussion revolves around the existence and detection of outliers in categorical data, particularly in the context of performing a chi-square test for independence. Participants explore whether outliers can be identified in categorical variables and the appropriate methods for doing so.

Discussion Character

  • Debate/contested
  • Technical explanation

Main Points Raised

  • One participant questions the possibility of outliers in categorical data, suggesting that variation is limited and that outliers might only arise from erroneous data entries.
  • Another participant asserts that boxplots are not suitable for categorical data and that the concept of outliers may not apply.
  • It is mentioned that ensuring a minimum of 5 expected counts in each cell of a chi-square test is crucial for valid results.
  • A suggestion is made to use the Fisher Exact Test as an alternative when expected counts are low, along with conditions for using the chi-square approximation.
  • One participant notes that outliers can exist in categorical data when multiple variables are involved, providing an example of a case with binary variables.
  • There is a reference to specialized techniques for categorical analysis that may be covered in academic courses.

Areas of Agreement / Disagreement

Participants express differing views on the existence and detection of outliers in categorical data, with no consensus reached on the applicability of outlier concepts in this context.

Contextual Notes

Some assumptions regarding the definitions of outliers and the nature of categorical data remain unresolved, particularly in relation to the statistical methods discussed.

cynnetje
Messages
2
Reaction score
0
Hello!

I am working on a pre-analysis plan and have to specify what I am going to do with outliers. I have two categorical variables (5 levels and 2 levels) and I will be performing a chi-square test for independence.

I thought of using a boxplot to detect outliers, but now I am not sure if it is even possible to have outliers in categorical data. You have such a small range, so a lot of variation in the data won't be possible. The only outlier I could think of is wrong data (data which falls outside the possible range due mistakes). I have looked online and in my statistic books, but was unable to find a solution, so I really hope someone here can help me out.

To summarize, is it possible to have outliers in categorical data and if yes, how do I detect them?

Thank you so much for your time and have a nice day!
 
Physics news on Phys.org
You can't do a box and whiskers plot for categorical data. The idea of outliers doesn't make a lot of sense for that kind of data.

However, what you do need is to make sure that each cell in your chi square test has a minimum of 5 expected counts.
 
  • Like
Likes   Reactions: cynnetje
Thank you for your answer, very helpful! We will be checking the assumptions, thank you for mentioning it:)
 
You are welcome. Let us know if you have any follow up questions.
 
Just to add onto Dale's response. If you find cells with an expectation less than 5, an alternative test you may use is the Fisher Exact Test. Also if 80% of the cells are above 5 and all cells are above 1, then a chi-square distribution can still be a good approximation for the p-value. The all cells above 5 rule and Fisher Exact Test are both conservative rules.
 
  • Like
Likes   Reactions: Dale
Hey cynnetje.

Have you ever done any sort of categorical analysis statistics at your university?

There are specialized techniques used for categorical data and these are done in either a A-level statistics course [undergraduate] or a specialized course on categorical analysis [in graduate school].
 
In general you can have some sort of outliers with categorial data, but only if you have multiple variables. As an example, take 10 binary variables where all but one test persons have "1" in 0 to 2 of the variables, where this one test person has "1" in all 10 variables. That is clearly an outlier.
 

Similar threads

  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 26 ·
Replies
26
Views
3K
  • · Replies 5 ·
Replies
5
Views
9K
  • · Replies 20 ·
Replies
20
Views
4K
  • · Replies 5 ·
Replies
5
Views
4K