Test for independence in contingency tables with zeros

In summary, the conversation discusses the issue of performing a test for independence using a 16x16 contingency table with a large number of zero cells. The commonly used tests such as χ2-test, χ2-square test with Yate's correction, and G-test are not applicable in this case due to divisions by zero. Suggestions are made to group the data into coarser bins or use the Wilcoxon signed-rank test and rank sum test. Another approach is to simulate the data and create a statistic to determine if there is a correlation between the variables. The conversation also mentions the possibility of using SVD or Non-Negative Matrix factorization for this purpose but no specific tests are known.
  • #1
mnb96
715
5
Hello,

I have a 16x16 contingency table, and I would like to perform a test in order to reject the hypothesis of independence between the variable X (columns) and the other variable Y (rows).
Unfortunately many cells in the table contain zeros. The zeros are due to the fact that the sample size was not sufficiently big to properly populate the table (I had "only" 200 samples, while the cells of the table are 256).

In this case I cannot use neither the χ2-test, nor the χ2-square test with Yate's correction, nor the G-test, because I obviously get divisions by zero.



What kind of test can I perform in such situations?
 
Physics news on Phys.org
  • #2
mnb96 said:
I have a 16x16 contingency table, and I would like to perform a test in order to reject the hypothesis of independence between the variable X (columns) and the other variable Y (rows).

You'll get the best advice if you describe the real world problem. I hope when you say "in order to reject" you don't mean that you have set rejection as your goal and that you are casting about for some statistical method to attain it.

If you are determined to use bins, you could group the data into coarser bins. For example put the cases (x,y) in (1,1),(1,2),(2,1), or (2,2) in the same "bin". Whether this makes sense depends on the details of the problem.
 
  • #3
Stephen Tashi said:
I hope when you say "in order to reject" you don't mean that you have set rejection as your goal and that you are casting about for some statistical method to attain it.

Not really. Perhaps I am just "casting about for some statistical method" that I could use to perform an hypothesis test! All the commonly used tests are simply inapplicable here.


If you are determined to use bins, you could group the data into coarser bins. For example put the cases (x,y) in (1,1),(1,2),(2,1), or (2,2) in the same "bin". Whether this makes sense depends on the details of the problem.

I basically have a stream of data of the kind (X1,Y2),...,(Xn,Yn) where each "unit" is simply composed of a pair of integer numbers in the range [0,...,15]. If you are familiar with bits and bytes, then you can imagine this stream of data as a sequence of bytes, where each pair (X,Y) is given by the lower 4-bits (the variable X), and the higher 4-bits (the variable Y).

Basically I would like to find statistical evidence that the information contained in X (the lower bits) is correlated with the information in Y (the higher bits). I would prefer to avoid rebinning.

Currently I am running a simulation with artificial data, and I already know that such data do manifest a correlation between X and Y. It is even quite evident from the SVD of the 16x16 contingency table, which has two very large eigenvalues, and all the other eigenvalues are close to 0. But again, I don't know how to make a decision test out of it.
 
  • #4
Since you have naturally paired observations, look at the Wilcoxon signed-rank test and rank sum test.


Since you know how to do simulation, another approach is to make up your own statistic and obtain its distribution (under the assumption the null hypothesis is true) by using simulation.

For example there is a formula that gives how to compute the chi-square statistic from given sample values. Then there are the associated complications of finding what table to use for the distribution of that statistic - such as checking no cells are empty, finding the "degrees of freedom" etc. You don't have to deal with the complications of finding what table to use if you obtain the distribution of the statistic by simulation.

In Statistics, a "statistic" is defined as some function of the values in a sample. So a statistic can be any function. If you can simulate how the values in a sample are generated, you can obtain a good estimate for the distribution of a statistic. If you have a model for generating the data when the null hypothesis is false, you can determine how effective a statistic is at discriminating between the null hypothesis and the alternative.
 
  • #5
Hi Stephen,
thanks for your suggestion: I didn't know anything about Wilcoxon signed-rank test.

About the simulation part, I didn't understand completely what you meant.
Are you suggesting that I should simulate a scenario where X and Y are independent and generate very many samples in order to obtain a good estimate of the joint distribution P(X,Y)=P(X)P(Y) in case of independency?By the way, are you aware of any test of independence that is based on the SVD or on the Non-Negative Matrix factorization? As I said, from both the SVD and the NNMF I am able to visually see that the 16x16 contingency table can be very faithfully reconstructed only with a rank-2 approximation (thus suggesting a dependency). The problem is how to convert this "visual evidence" into a yes/no decision?
 
  • #6
mnb96 said:
Are you suggesting that I should simulate a scenario where X and Y are independent and generate very many samples in order to obtain a good estimate of the joint distribution P(X,Y)=P(X)P(Y) in case of independency?

I'm saying you can implement a simulation where x and y are independent and find the distribution of any scalar valued function of the data that you you want. Let F( (x1,y1),(x2,y2),...(xn,yn)) be any real valued function of the data that is of interest. You can generate simulated samples and from them create a histogram of the values of F that approximates its distribution. From that histogram, for a given number r, you can determine the probability that F is less than r. Thus you can do a "one-tailed" hypothesis test based on the observed value of F on the actual data. You could also do a "two-tailed test".


By the way, are you aware of any test of independence that is based on the SVD or on the Non-Negative Matrix factorization?

I'm not familiar with such a tests for "categorical data" (e.g. if your bytes represent things like different medicines or names of models of cars ). Things involving matrix multiplications relate to linear transformations of coordinates. For contexts where applying a transformation of coordinates to the data makes sense (e.g. if your bytes represent spatial coordinates) there are methods of data analysis such as "principal components analysis" and "independent components" analysis". I assume that there are "well known" statistical tests that accompany these methods - but they aren't "well known" to me.
 
  • Like
Likes 1 person
  • #7
Thanks for the suggestions. I think I got the idea now.

About the SVD-like techniques (e.g. NNMF), I have seen them applied to categorical data too. One example I came across is this and it is supposedly extracted from an example found in a book. It seems that the more general definition of such statistical techniques goes under the name of "correspondence analysis".
 

1. What is a contingency table?

A contingency table is a type of table used in statistics to display the frequency distribution of two or more categorical variables. It shows the number of individuals in each combination of categories for the variables, allowing for easy comparison and analysis.

2. Why is a test for independence necessary in contingency tables with zeros?

A test for independence is necessary in contingency tables with zeros because the presence of zeros can affect the results of the analysis. Zeros can occur when there is no overlap between the categories of the variables being studied, and they can skew the results if not properly accounted for.

3. How is a test for independence performed in contingency tables with zeros?

A test for independence in contingency tables with zeros is typically performed using statistical methods such as the chi-square test or Fisher's exact test. These tests take into account the presence of zeros and allow for a more accurate analysis of the relationship between the variables.

4. What does it mean if the test for independence is significant?

If the test for independence in a contingency table with zeros is significant, it means that there is a statistically significant relationship between the variables being studied. This indicates that the variables are not independent of each other and there is a significant association between them.

5. Can a test for independence be performed on contingency tables with more than two variables?

Yes, a test for independence can be performed on contingency tables with more than two variables. This is known as a chi-square test for association or a log-linear model and is used to analyze the relationship between multiple categorical variables simultaneously.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
26
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
20
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
3K
Back
Top