Test for independence in contingency tables with zeros

  • Thread starter Thread starter mnb96
  • Start date Start date
  • Tags Tags
    Independence Test
Click For Summary
The discussion revolves around performing a test for independence in a 16x16 contingency table with many zero entries due to a small sample size of 200. Standard tests like the χ2-test and G-test are inapplicable because of the zeros. Suggestions include using the Wilcoxon signed-rank test or simulating data to estimate the distribution of a custom statistic under the null hypothesis. The conversation also touches on using Singular Value Decomposition (SVD) and Non-Negative Matrix Factorization (NNMF) as potential methods for analyzing the data, with references to correspondence analysis for categorical data. The need for a robust statistical approach to determine dependency between the variables X and Y remains a key focus.
mnb96
Messages
711
Reaction score
5
Hello,

I have a 16x16 contingency table, and I would like to perform a test in order to reject the hypothesis of independence between the variable X (columns) and the other variable Y (rows).
Unfortunately many cells in the table contain zeros. The zeros are due to the fact that the sample size was not sufficiently big to properly populate the table (I had "only" 200 samples, while the cells of the table are 256).

In this case I cannot use neither the χ2-test, nor the χ2-square test with Yate's correction, nor the G-test, because I obviously get divisions by zero.



What kind of test can I perform in such situations?
 
Physics news on Phys.org
mnb96 said:
I have a 16x16 contingency table, and I would like to perform a test in order to reject the hypothesis of independence between the variable X (columns) and the other variable Y (rows).

You'll get the best advice if you describe the real world problem. I hope when you say "in order to reject" you don't mean that you have set rejection as your goal and that you are casting about for some statistical method to attain it.

If you are determined to use bins, you could group the data into coarser bins. For example put the cases (x,y) in (1,1),(1,2),(2,1), or (2,2) in the same "bin". Whether this makes sense depends on the details of the problem.
 
Stephen Tashi said:
I hope when you say "in order to reject" you don't mean that you have set rejection as your goal and that you are casting about for some statistical method to attain it.

Not really. Perhaps I am just "casting about for some statistical method" that I could use to perform an hypothesis test! All the commonly used tests are simply inapplicable here.


If you are determined to use bins, you could group the data into coarser bins. For example put the cases (x,y) in (1,1),(1,2),(2,1), or (2,2) in the same "bin". Whether this makes sense depends on the details of the problem.

I basically have a stream of data of the kind (X1,Y2),...,(Xn,Yn) where each "unit" is simply composed of a pair of integer numbers in the range [0,...,15]. If you are familiar with bits and bytes, then you can imagine this stream of data as a sequence of bytes, where each pair (X,Y) is given by the lower 4-bits (the variable X), and the higher 4-bits (the variable Y).

Basically I would like to find statistical evidence that the information contained in X (the lower bits) is correlated with the information in Y (the higher bits). I would prefer to avoid rebinning.

Currently I am running a simulation with artificial data, and I already know that such data do manifest a correlation between X and Y. It is even quite evident from the SVD of the 16x16 contingency table, which has two very large eigenvalues, and all the other eigenvalues are close to 0. But again, I don't know how to make a decision test out of it.
 
Since you have naturally paired observations, look at the Wilcoxon signed-rank test and rank sum test.


Since you know how to do simulation, another approach is to make up your own statistic and obtain its distribution (under the assumption the null hypothesis is true) by using simulation.

For example there is a formula that gives how to compute the chi-square statistic from given sample values. Then there are the associated complications of finding what table to use for the distribution of that statistic - such as checking no cells are empty, finding the "degrees of freedom" etc. You don't have to deal with the complications of finding what table to use if you obtain the distribution of the statistic by simulation.

In Statistics, a "statistic" is defined as some function of the values in a sample. So a statistic can be any function. If you can simulate how the values in a sample are generated, you can obtain a good estimate for the distribution of a statistic. If you have a model for generating the data when the null hypothesis is false, you can determine how effective a statistic is at discriminating between the null hypothesis and the alternative.
 
Hi Stephen,
thanks for your suggestion: I didn't know anything about Wilcoxon signed-rank test.

About the simulation part, I didn't understand completely what you meant.
Are you suggesting that I should simulate a scenario where X and Y are independent and generate very many samples in order to obtain a good estimate of the joint distribution P(X,Y)=P(X)P(Y) in case of independency?By the way, are you aware of any test of independence that is based on the SVD or on the Non-Negative Matrix factorization? As I said, from both the SVD and the NNMF I am able to visually see that the 16x16 contingency table can be very faithfully reconstructed only with a rank-2 approximation (thus suggesting a dependency). The problem is how to convert this "visual evidence" into a yes/no decision?
 
mnb96 said:
Are you suggesting that I should simulate a scenario where X and Y are independent and generate very many samples in order to obtain a good estimate of the joint distribution P(X,Y)=P(X)P(Y) in case of independency?

I'm saying you can implement a simulation where x and y are independent and find the distribution of any scalar valued function of the data that you you want. Let F( (x1,y1),(x2,y2),...(xn,yn)) be any real valued function of the data that is of interest. You can generate simulated samples and from them create a histogram of the values of F that approximates its distribution. From that histogram, for a given number r, you can determine the probability that F is less than r. Thus you can do a "one-tailed" hypothesis test based on the observed value of F on the actual data. You could also do a "two-tailed test".


By the way, are you aware of any test of independence that is based on the SVD or on the Non-Negative Matrix factorization?

I'm not familiar with such a tests for "categorical data" (e.g. if your bytes represent things like different medicines or names of models of cars ). Things involving matrix multiplications relate to linear transformations of coordinates. For contexts where applying a transformation of coordinates to the data makes sense (e.g. if your bytes represent spatial coordinates) there are methods of data analysis such as "principal components analysis" and "independent components" analysis". I assume that there are "well known" statistical tests that accompany these methods - but they aren't "well known" to me.
 
  • Like
Likes 1 person
Thanks for the suggestions. I think I got the idea now.

About the SVD-like techniques (e.g. NNMF), I have seen them applied to categorical data too. One example I came across is this and it is supposedly extracted from an example found in a book. It seems that the more general definition of such statistical techniques goes under the name of "correspondence analysis".
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
Replies
2
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
1
Views
3K
Replies
3
Views
4K
  • · Replies 20 ·
Replies
20
Views
3K
Replies
26
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K