Need statistical test for contingency table with very small and big counts

I have a 2x2 contingency table, and I want to discover how likely it is that the two events are dependent.

The top-left cell is usually in the range of 1-10. The bottom-right cell can be over 3 billion. The other two cells are in the hundreds or millions (exclusively).

I have tried Pearson's chi-squared test, and Yates correction for continuity, with varying degrees of success due to the large differences in magnitude of the entries. This also has the null-hypothesis reversed; I want to use the p-value to definitively 'find' dependence (potentially missing out on some for which there is no evidence), as opposed to finding independence and potentially misclassifying dependent events.

Fischer's Exact and Barnard's Exact tests look promising, but calculating the factorial of a number in the billions is just "impossible".

---

In case more overview is needed:

I have two binary events (let's say X and Y) for which I track occurrences throughout various "documents". Sometimes they occur together (A), one occurs without the other (B, C), or neither occur together (D). This creates my 2x2 contingency table, with A in the top-left and D in the bottom-right, and D being incredibly huge. There are N = A+B+C+D documents.

From this I want some measure of how likely it is that X is related to Y. I want to be able to know when X and Y are almost surely dependent.

I'd rather err on the side of caution and say that X and Y are not likely related when they actually are, as opposed to saying that they are likely related when they are not. In other words, a "score" of 0 might indicate no evidence of dependence, and 1 indicates overwhelming evidence that they are dependent.

Hope this makes sense.

Last edited:

Stephen Tashi
From this I want some measure of how likely it is that X is related to Y. I want to be able to know when X and Y are almost surely dependent.

I'd rather err on the side of caution and say that X and Y are not likely related when they actually are, as opposed to saying that they are likely related when they are not. In other words, a "score" of 0 might indicate no evidence of dependence, and 1 indicates overwhelming evidence that they are dependent.

Hope this makes sense.

You might already realize this, but the type of statistics you are talking about ( "frequentist") does not quantify the probability of any idea about the data, such as that two things are independent or not. The numbers that it computes are essentially "the probability of the data given the assumption of some ideas", not "the probability of some ideas given the data". The approach of frequentist statistics is that if the probability of the data given the hypothesis is "small" (which is a subjective judgment) then the procedure is to "reject" the assumed ideas. If the probability of the data is, say, 0.05 given that a certain hypothesis is assumed, this does not imply that the probability that hypothesis is false is 0.95.

Example: I saw a man dumping a body in a lake. Is he a murderer?
Null hypothesis: The man is a murderer.
Computation: The chances that a murder will dispose of a body by dumping it in a lake are only .02.
Conclusion: We reject the null hypothesis. The man is not murderer.

Thanks for the reply. I was hoping I could have the null hypothesis set up such that I would be rejecting the null-hypothesis of dependence. This should permit me to know when there is evidence to suggest that the two variables are dependent, right? If that is the case, then my earlier comment about Pearson's being the opposite of what I want was incorrect.

Is the Phi Coefficient of the Yates corrected statistic better suited for my problem? What about some of the other correlation measures?

Last edited: