Test for independence in contingency tables with zeros

  • Context: Graduate 
  • Thread starter Thread starter mnb96
  • Start date Start date
  • Tags Tags
    Independence Test
Click For Summary

Discussion Overview

The discussion revolves around performing a test for independence in a 16x16 contingency table that contains many zero entries due to a limited sample size. Participants explore various statistical methods and approaches to address the challenges posed by the zeros in the data.

Discussion Character

  • Debate/contested
  • Technical explanation
  • Exploratory

Main Points Raised

  • One participant expresses the need to test for independence between variables X and Y but faces issues with zeros in the contingency table, making traditional tests like the χ2-test inapplicable.
  • Another participant suggests that the original poster should clarify the real-world problem and cautions against aiming to reject the null hypothesis without proper justification.
  • Some participants propose grouping data into coarser bins to mitigate the issue of zeros, although the original poster prefers to avoid this approach.
  • A suggestion is made to use the Wilcoxon signed-rank test or rank sum test, considering the paired nature of the observations.
  • There is a discussion about simulating data to create a distribution for a custom statistic, which could help in hypothesis testing without the complications of traditional methods.
  • The original poster inquires about tests based on Singular Value Decomposition (SVD) or Non-Negative Matrix Factorization (NNMF) and expresses interest in converting visual evidence from these methods into a formal decision test.
  • Another participant mentions that while they are not familiar with tests for categorical data using SVD or NNMF, there are related methods like principal components analysis that could be relevant.

Areas of Agreement / Disagreement

Participants do not reach a consensus on a specific method to test for independence due to the presence of zeros in the contingency table. Multiple competing views and approaches are presented, reflecting the complexity of the problem.

Contextual Notes

Participants highlight limitations such as the dependence on sample size, the implications of zero entries in the data, and the challenges of applying traditional statistical tests in this context.

mnb96
Messages
711
Reaction score
5
Hello,

I have a 16x16 contingency table, and I would like to perform a test in order to reject the hypothesis of independence between the variable X (columns) and the other variable Y (rows).
Unfortunately many cells in the table contain zeros. The zeros are due to the fact that the sample size was not sufficiently big to properly populate the table (I had "only" 200 samples, while the cells of the table are 256).

In this case I cannot use neither the χ2-test, nor the χ2-square test with Yate's correction, nor the G-test, because I obviously get divisions by zero.



What kind of test can I perform in such situations?
 
Physics news on Phys.org
mnb96 said:
I have a 16x16 contingency table, and I would like to perform a test in order to reject the hypothesis of independence between the variable X (columns) and the other variable Y (rows).

You'll get the best advice if you describe the real world problem. I hope when you say "in order to reject" you don't mean that you have set rejection as your goal and that you are casting about for some statistical method to attain it.

If you are determined to use bins, you could group the data into coarser bins. For example put the cases (x,y) in (1,1),(1,2),(2,1), or (2,2) in the same "bin". Whether this makes sense depends on the details of the problem.
 
Stephen Tashi said:
I hope when you say "in order to reject" you don't mean that you have set rejection as your goal and that you are casting about for some statistical method to attain it.

Not really. Perhaps I am just "casting about for some statistical method" that I could use to perform an hypothesis test! All the commonly used tests are simply inapplicable here.


If you are determined to use bins, you could group the data into coarser bins. For example put the cases (x,y) in (1,1),(1,2),(2,1), or (2,2) in the same "bin". Whether this makes sense depends on the details of the problem.

I basically have a stream of data of the kind (X1,Y2),...,(Xn,Yn) where each "unit" is simply composed of a pair of integer numbers in the range [0,...,15]. If you are familiar with bits and bytes, then you can imagine this stream of data as a sequence of bytes, where each pair (X,Y) is given by the lower 4-bits (the variable X), and the higher 4-bits (the variable Y).

Basically I would like to find statistical evidence that the information contained in X (the lower bits) is correlated with the information in Y (the higher bits). I would prefer to avoid rebinning.

Currently I am running a simulation with artificial data, and I already know that such data do manifest a correlation between X and Y. It is even quite evident from the SVD of the 16x16 contingency table, which has two very large eigenvalues, and all the other eigenvalues are close to 0. But again, I don't know how to make a decision test out of it.
 
Since you have naturally paired observations, look at the Wilcoxon signed-rank test and rank sum test.


Since you know how to do simulation, another approach is to make up your own statistic and obtain its distribution (under the assumption the null hypothesis is true) by using simulation.

For example there is a formula that gives how to compute the chi-square statistic from given sample values. Then there are the associated complications of finding what table to use for the distribution of that statistic - such as checking no cells are empty, finding the "degrees of freedom" etc. You don't have to deal with the complications of finding what table to use if you obtain the distribution of the statistic by simulation.

In Statistics, a "statistic" is defined as some function of the values in a sample. So a statistic can be any function. If you can simulate how the values in a sample are generated, you can obtain a good estimate for the distribution of a statistic. If you have a model for generating the data when the null hypothesis is false, you can determine how effective a statistic is at discriminating between the null hypothesis and the alternative.
 
Hi Stephen,
thanks for your suggestion: I didn't know anything about Wilcoxon signed-rank test.

About the simulation part, I didn't understand completely what you meant.
Are you suggesting that I should simulate a scenario where X and Y are independent and generate very many samples in order to obtain a good estimate of the joint distribution P(X,Y)=P(X)P(Y) in case of independency?By the way, are you aware of any test of independence that is based on the SVD or on the Non-Negative Matrix factorization? As I said, from both the SVD and the NNMF I am able to visually see that the 16x16 contingency table can be very faithfully reconstructed only with a rank-2 approximation (thus suggesting a dependency). The problem is how to convert this "visual evidence" into a yes/no decision?
 
mnb96 said:
Are you suggesting that I should simulate a scenario where X and Y are independent and generate very many samples in order to obtain a good estimate of the joint distribution P(X,Y)=P(X)P(Y) in case of independency?

I'm saying you can implement a simulation where x and y are independent and find the distribution of any scalar valued function of the data that you you want. Let F( (x1,y1),(x2,y2),...(xn,yn)) be any real valued function of the data that is of interest. You can generate simulated samples and from them create a histogram of the values of F that approximates its distribution. From that histogram, for a given number r, you can determine the probability that F is less than r. Thus you can do a "one-tailed" hypothesis test based on the observed value of F on the actual data. You could also do a "two-tailed test".


By the way, are you aware of any test of independence that is based on the SVD or on the Non-Negative Matrix factorization?

I'm not familiar with such a tests for "categorical data" (e.g. if your bytes represent things like different medicines or names of models of cars ). Things involving matrix multiplications relate to linear transformations of coordinates. For contexts where applying a transformation of coordinates to the data makes sense (e.g. if your bytes represent spatial coordinates) there are methods of data analysis such as "principal components analysis" and "independent components" analysis". I assume that there are "well known" statistical tests that accompany these methods - but they aren't "well known" to me.
 
  • Like
Likes   Reactions: 1 person
Thanks for the suggestions. I think I got the idea now.

About the SVD-like techniques (e.g. NNMF), I have seen them applied to categorical data too. One example I came across is this and it is supposedly extracted from an example found in a book. It seems that the more general definition of such statistical techniques goes under the name of "correspondence analysis".
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
Replies
2
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
1
Views
3K
  • · Replies 3 ·
Replies
3
Views
5K
  • · Replies 20 ·
Replies
20
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K