Need statistical test for contingency table with very small and big counts

Click For Summary
SUMMARY

This discussion focuses on statistical testing for a 2x2 contingency table with highly imbalanced counts, specifically where one cell can exceed 3 billion while others range from 1-10. The user has attempted Pearson's chi-squared test and Yates correction but found them inadequate due to the magnitude differences. They are exploring Fisher's Exact and Barnard's Exact tests, although factorial calculations for large numbers pose a challenge. The goal is to establish a measure of dependence between two binary events, X and Y, with a preference for conservative conclusions regarding their relationship.

PREREQUISITES
  • Understanding of 2x2 contingency tables
  • Familiarity with Pearson's chi-squared test and Yates correction
  • Knowledge of Fisher's Exact and Barnard's Exact tests
  • Basic concepts of frequentist statistics
NEXT STEPS
  • Research the implementation of Fisher's Exact test for large datasets
  • Explore Barnard's Exact test and its computational methods
  • Study the Phi Coefficient and its application in contingency tables
  • Learn about alternative statistical measures for assessing dependence
USEFUL FOR

Statisticians, data analysts, and researchers working with contingency tables, particularly those dealing with imbalanced data distributions and seeking to understand event dependencies.

Sane
Messages
220
Reaction score
0
I have a 2x2 contingency table, and I want to discover how likely it is that the two events are dependent.

The top-left cell is usually in the range of 1-10. The bottom-right cell can be over 3 billion. The other two cells are in the hundreds or millions (exclusively).

I have tried Pearson's chi-squared test, and Yates correction for continuity, with varying degrees of success due to the large differences in magnitude of the entries. This also has the null-hypothesis reversed; I want to use the p-value to definitively 'find' dependence (potentially missing out on some for which there is no evidence), as opposed to finding independence and potentially misclassifying dependent events.

Fischer's Exact and Barnard's Exact tests look promising, but calculating the factorial of a number in the billions is just "impossible".

---

In case more overview is needed:

I have two binary events (let's say X and Y) for which I track occurrences throughout various "documents". Sometimes they occur together (A), one occurs without the other (B, C), or neither occur together (D). This creates my 2x2 contingency table, with A in the top-left and D in the bottom-right, and D being incredibly huge. There are N = A+B+C+D documents.

From this I want some measure of how likely it is that X is related to Y. I want to be able to know when X and Y are almost surely dependent.

I'd rather err on the side of caution and say that X and Y are not likely related when they actually are, as opposed to saying that they are likely related when they are not. In other words, a "score" of 0 might indicate no evidence of dependence, and 1 indicates overwhelming evidence that they are dependent.

Hope this makes sense.
 
Last edited:
Physics news on Phys.org
Sane said:
From this I want some measure of how likely it is that X is related to Y. I want to be able to know when X and Y are almost surely dependent.

I'd rather err on the side of caution and say that X and Y are not likely related when they actually are, as opposed to saying that they are likely related when they are not. In other words, a "score" of 0 might indicate no evidence of dependence, and 1 indicates overwhelming evidence that they are dependent.

Hope this makes sense.

You might already realize this, but the type of statistics you are talking about ( "frequentist") does not quantify the probability of any idea about the data, such as that two things are independent or not. The numbers that it computes are essentially "the probability of the data given the assumption of some ideas", not "the probability of some ideas given the data". The approach of frequentist statistics is that if the probability of the data given the hypothesis is "small" (which is a subjective judgment) then the procedure is to "reject" the assumed ideas. If the probability of the data is, say, 0.05 given that a certain hypothesis is assumed, this does not imply that the probability that hypothesis is false is 0.95.

Example: I saw a man dumping a body in a lake. Is he a murderer?
Null hypothesis: The man is a murderer.
Computation: The chances that a murder will dispose of a body by dumping it in a lake are only .02.
Conclusion: We reject the null hypothesis. The man is not murderer.
 
Thanks for the reply. I was hoping I could have the null hypothesis set up such that I would be rejecting the null-hypothesis of dependence. This should permit me to know when there is evidence to suggest that the two variables are dependent, right? If that is the case, then my earlier comment about Pearson's being the opposite of what I want was incorrect.

Is the Phi Coefficient of the Yates corrected statistic better suited for my problem? What about some of the other correlation measures?
 
Last edited:

Similar threads

  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 2 ·
Replies
2
Views
8K
Replies
1
Views
736
  • · Replies 15 ·
Replies
15
Views
6K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 8 ·
Replies
8
Views
963
  • · Replies 54 ·
2
Replies
54
Views
6K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 14 ·
Replies
14
Views
3K