Need statistical test for contingency table with very small and big counts

AI Thread Summary
The discussion revolves around analyzing a 2x2 contingency table to determine the dependence between two binary events, X and Y, where one cell has very small counts and another has counts exceeding three billion. The user has attempted Pearson's chi-squared test and Yates' correction but faced challenges due to the disparity in cell values. They are exploring alternatives like Fisher's Exact and Barnard's Exact tests, but factorial calculations for large numbers are impractical. The goal is to establish a method that indicates dependence with a score ranging from 0 (no evidence) to 1 (overwhelming evidence), while also considering the limitations of frequentist statistics. The user seeks advice on whether the Phi Coefficient or other correlation measures might be more suitable for their analysis.
Sane
Messages
220
Reaction score
0
I have a 2x2 contingency table, and I want to discover how likely it is that the two events are dependent.

The top-left cell is usually in the range of 1-10. The bottom-right cell can be over 3 billion. The other two cells are in the hundreds or millions (exclusively).

I have tried Pearson's chi-squared test, and Yates correction for continuity, with varying degrees of success due to the large differences in magnitude of the entries. This also has the null-hypothesis reversed; I want to use the p-value to definitively 'find' dependence (potentially missing out on some for which there is no evidence), as opposed to finding independence and potentially misclassifying dependent events.

Fischer's Exact and Barnard's Exact tests look promising, but calculating the factorial of a number in the billions is just "impossible".

---

In case more overview is needed:

I have two binary events (let's say X and Y) for which I track occurrences throughout various "documents". Sometimes they occur together (A), one occurs without the other (B, C), or neither occur together (D). This creates my 2x2 contingency table, with A in the top-left and D in the bottom-right, and D being incredibly huge. There are N = A+B+C+D documents.

From this I want some measure of how likely it is that X is related to Y. I want to be able to know when X and Y are almost surely dependent.

I'd rather err on the side of caution and say that X and Y are not likely related when they actually are, as opposed to saying that they are likely related when they are not. In other words, a "score" of 0 might indicate no evidence of dependence, and 1 indicates overwhelming evidence that they are dependent.

Hope this makes sense.
 
Last edited:
Physics news on Phys.org
Sane said:
From this I want some measure of how likely it is that X is related to Y. I want to be able to know when X and Y are almost surely dependent.

I'd rather err on the side of caution and say that X and Y are not likely related when they actually are, as opposed to saying that they are likely related when they are not. In other words, a "score" of 0 might indicate no evidence of dependence, and 1 indicates overwhelming evidence that they are dependent.

Hope this makes sense.

You might already realize this, but the type of statistics you are talking about ( "frequentist") does not quantify the probability of any idea about the data, such as that two things are independent or not. The numbers that it computes are essentially "the probability of the data given the assumption of some ideas", not "the probability of some ideas given the data". The approach of frequentist statistics is that if the probability of the data given the hypothesis is "small" (which is a subjective judgment) then the procedure is to "reject" the assumed ideas. If the probability of the data is, say, 0.05 given that a certain hypothesis is assumed, this does not imply that the probability that hypothesis is false is 0.95.

Example: I saw a man dumping a body in a lake. Is he a murderer?
Null hypothesis: The man is a murderer.
Computation: The chances that a murder will dispose of a body by dumping it in a lake are only .02.
Conclusion: We reject the null hypothesis. The man is not murderer.
 
Thanks for the reply. I was hoping I could have the null hypothesis set up such that I would be rejecting the null-hypothesis of dependence. This should permit me to know when there is evidence to suggest that the two variables are dependent, right? If that is the case, then my earlier comment about Pearson's being the opposite of what I want was incorrect.

Is the Phi Coefficient of the Yates corrected statistic better suited for my problem? What about some of the other correlation measures?
 
Last edited:
I'm taking a look at intuitionistic propositional logic (IPL). Basically it exclude Double Negation Elimination (DNE) from the set of axiom schemas replacing it with Ex falso quodlibet: ⊥ → p for any proposition p (including both atomic and composite propositions). In IPL, for instance, the Law of Excluded Middle (LEM) p ∨ ¬p is no longer a theorem. My question: aside from the logic formal perspective, is IPL supposed to model/address some specific "kind of world" ? Thanks.
I was reading a Bachelor thesis on Peano Arithmetic (PA). PA has the following axioms (not including the induction schema): $$\begin{align} & (A1) ~~~~ \forall x \neg (x + 1 = 0) \nonumber \\ & (A2) ~~~~ \forall xy (x + 1 =y + 1 \to x = y) \nonumber \\ & (A3) ~~~~ \forall x (x + 0 = x) \nonumber \\ & (A4) ~~~~ \forall xy (x + (y +1) = (x + y ) + 1) \nonumber \\ & (A5) ~~~~ \forall x (x \cdot 0 = 0) \nonumber \\ & (A6) ~~~~ \forall xy (x \cdot (y + 1) = (x \cdot y) + x) \nonumber...
Back
Top