Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Test for independence in contingency tables with zeros

  1. Apr 10, 2014 #1

    I have a 16x16 contingency table, and I would like to perform a test in order to reject the hypothesis of independence between the variable X (columns) and the other variable Y (rows).
    Unfortunately many cells in the table contain zeros. The zeros are due to the fact that the sample size was not sufficiently big to properly populate the table (I had "only" 200 samples, while the cells of the table are 256).

    In this case I cannot use neither the χ2-test, nor the χ2-square test with Yate's correction, nor the G-test, because I obviously get divisions by zero.

    What kind of test can I perform in such situations?
  2. jcsd
  3. Apr 10, 2014 #2

    Stephen Tashi

    User Avatar
    Science Advisor

    You'll get the best advice if you describe the real world problem. I hope when you say "in order to reject" you don't mean that you have set rejection as your goal and that you are casting about for some statistical method to attain it.

    If you are determined to use bins, you could group the data into coarser bins. For example put the cases (x,y) in (1,1),(1,2),(2,1), or (2,2) in the same "bin". Whether this makes sense depends on the details of the problem.
  4. Apr 10, 2014 #3
    Not really. Perhaps I am just "casting about for some statistical method" that I could use to perform an hypothesis test! All the commonly used tests are simply inapplicable here.

    I basically have a stream of data of the kind (X1,Y2),...,(Xn,Yn) where each "unit" is simply composed of a pair of integer numbers in the range [0,...,15]. If you are familiar with bits and bytes, then you can imagine this stream of data as a sequence of bytes, where each pair (X,Y) is given by the lower 4-bits (the variable X), and the higher 4-bits (the variable Y).

    Basically I would like to find statistical evidence that the information contained in X (the lower bits) is correlated with the information in Y (the higher bits). I would prefer to avoid rebinning.

    Currently I am running a simulation with artificial data, and I already know that such data do manifest a correlation between X and Y. It is even quite evident from the SVD of the 16x16 contingency table, which has two very large eigenvalues, and all the other eigenvalues are close to 0. But again, I don't know how to make a decision test out of it.
  5. Apr 10, 2014 #4

    Stephen Tashi

    User Avatar
    Science Advisor

    Since you have naturally paired observations, look at the Wilcoxon signed-rank test and rank sum test.

    Since you know how to do simulation, another approach is to make up your own statistic and obtain its distribution (under the assumption the null hypothesis is true) by using simulation.

    For example there is a formula that gives how to compute the chi-square statistic from given sample values. Then there are the associated complications of finding what table to use for the distribution of that statistic - such as checking no cells are empty, finding the "degrees of freedom" etc. You don't have to deal with the complications of finding what table to use if you obtain the distribution of the statistic by simulation.

    In Statistics, a "statistic" is defined as some function of the values in a sample. So a statistic can be any function. If you can simulate how the values in a sample are generated, you can obtain a good estimate for the distribution of a statistic. If you have a model for generating the data when the null hypothesis is false, you can determine how effective a statistic is at discriminating between the null hypothesis and the alternative.
  6. Apr 11, 2014 #5
    Hi Stephen,
    thanks for your suggestion: I didn't know anything about Wilcoxon signed-rank test.

    About the simulation part, I didn't understand completely what you meant.
    Are you suggesting that I should simulate a scenario where X and Y are independent and generate very many samples in order to obtain a good estimate of the joint distribution P(X,Y)=P(X)P(Y) in case of independency?

    By the way, are you aware of any test of independence that is based on the SVD or on the Non-Negative Matrix factorization? As I said, from both the SVD and the NNMF I am able to visually see that the 16x16 contingency table can be very faithfully reconstructed only with a rank-2 approximation (thus suggesting a dependency). The problem is how to convert this "visual evidence" into a yes/no decision?
  7. Apr 11, 2014 #6

    Stephen Tashi

    User Avatar
    Science Advisor

    I'm saying you can implement a simulation where x and y are independent and find the distribution of any scalar valued function of the data that you you want. Let F( (x1,y1),(x2,y2),....(xn,yn)) be any real valued function of the data that is of interest. You can generate simulated samples and from them create a histogram of the values of F that approximates its distribution. From that histogram, for a given number r, you can determine the probability that F is less than r. Thus you can do a "one-tailed" hypothesis test based on the observed value of F on the actual data. You could also do a "two-tailed test".

    I'm not familiar with such a tests for "categorical data" (e.g. if your bytes represent things like different medicines or names of models of cars ). Things involving matrix multiplications relate to linear transformations of coordinates. For contexts where applying a transformation of coordinates to the data makes sense (e.g. if your bytes represent spatial coordinates) there are methods of data analysis such as "principal components analysis" and "independent components" analysis". I assume that there are "well known" statistical tests that accompany these methods - but they aren't "well known" to me.
  8. Apr 14, 2014 #7
    Thanks for the suggestions. I think I got the idea now.

    About the SVD-like techniques (e.g. NNMF), I have seen them applied to categorical data too. One example I came across is this and it is supposedly extracted from an example found in a book. It seems that the more general definition of such statistical techniques goes under the name of "correspondence analysis".
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook