Test for independence in contingency tables with zeros

mnb96 · Apr 10, 2014

Hello,

I have a 16x16 contingency table, and I would like to perform a test in order to reject the hypothesis of independence between the variable X (columns) and the other variable Y (rows).
Unfortunately many cells in the table contain zeros. The zeros are due to the fact that the sample size was not sufficiently big to properly populate the table (I had "only" 200 samples, while the cells of the table are 256).

In this case I cannot use neither the χ²-test, nor the χ²-square test with Yate's correction, nor the G-test, because I obviously get divisions by zero.

What kind of test can I perform in such situations?

Stephen Tashi · Apr 10, 2014

mnb96 said:

I have a 16x16 contingency table, and I would like to perform a test in order to reject the hypothesis of independence between the variable X (columns) and the other variable Y (rows).

You'll get the best advice if you describe the real world problem. I hope when you say "in order to reject" you don't mean that you have set rejection as your goal and that you are casting about for some statistical method to attain it.

If you are determined to use bins, you could group the data into coarser bins. For example put the cases (x,y) in (1,1),(1,2),(2,1), or (2,2) in the same "bin". Whether this makes sense depends on the details of the problem.

mnb96 · Apr 10, 2014

Stephen Tashi said:

I hope when you say "in order to reject" you don't mean that you have set rejection as your goal and that you are casting about for some statistical method to attain it.

Not really. Perhaps I am just "casting about for some statistical method" that I could use to perform an hypothesis test! All the commonly used tests are simply inapplicable here.

If you are determined to use bins, you could group the data into coarser bins. For example put the cases (x,y) in (1,1),(1,2),(2,1), or (2,2) in the same "bin". Whether this makes sense depends on the details of the problem.

I basically have a stream of data of the kind (X₁,Y₂),...,(X_n,Y_n) where each "unit" is simply composed of a pair of integer numbers in the range [0,...,15]. If you are familiar with bits and bytes, then you can imagine this stream of data as a sequence of bytes, where each pair (X,Y) is given by the lower 4-bits (the variable X), and the higher 4-bits (the variable Y).

Basically I would like to find statistical evidence that the information contained in X (the lower bits) is correlated with the information in Y (the higher bits). I would prefer to avoid rebinning.

Currently I am running a simulation with artificial data, and I already know that such data do manifest a correlation between X and Y. It is even quite evident from the SVD of the 16x16 contingency table, which has two very large eigenvalues, and all the other eigenvalues are close to 0. But again, I don't know how to make a decision test out of it.

Stephen Tashi · Apr 10, 2014

Since you have naturally paired observations, look at the Wilcoxon signed-rank test and rank sum test.

Since you know how to do simulation, another approach is to make up your own statistic and obtain its distribution (under the assumption the null hypothesis is true) by using simulation.

For example there is a formula that gives how to compute the chi-square statistic from given sample values. Then there are the associated complications of finding what table to use for the distribution of that statistic - such as checking no cells are empty, finding the "degrees of freedom" etc. You don't have to deal with the complications of finding what table to use if you obtain the distribution of the statistic by simulation.

In Statistics, a "statistic" is defined as some function of the values in a sample. So a statistic can be any function. If you can simulate how the values in a sample are generated, you can obtain a good estimate for the distribution of a statistic. If you have a model for generating the data when the null hypothesis is false, you can determine how effective a statistic is at discriminating between the null hypothesis and the alternative.

mnb96 · Apr 11, 2014

Hi Stephen,
thanks for your suggestion: I didn't know anything about Wilcoxon signed-rank test.

About the simulation part, I didn't understand completely what you meant.
Are you suggesting that I should simulate a scenario where X and Y are independent and generate very many samples in order to obtain a good estimate of the joint distribution P(X,Y)=P(X)P(Y) in case of independency?By the way, are you aware of any test of independence that is based on the SVD or on the Non-Negative Matrix factorization? As I said, from both the SVD and the NNMF I am able to visually see that the 16x16 contingency table can be very faithfully reconstructed only with a rank-2 approximation (thus suggesting a dependency). The problem is how to convert this "visual evidence" into a yes/no decision?

Stephen Tashi · Apr 11, 2014

mnb96 said:

Are you suggesting that I should simulate a scenario where X and Y are independent and generate very many samples in order to obtain a good estimate of the joint distribution P(X,Y)=P(X)P(Y) in case of independency?

I'm saying you can implement a simulation where x and y are independent and find the distribution of any scalar valued function of the data that you you want. Let F( (x1,y1),(x2,y2),...(xn,yn)) be any real valued function of the data that is of interest. You can generate simulated samples and from them create a histogram of the values of F that approximates its distribution. From that histogram, for a given number r, you can determine the probability that F is less than r. Thus you can do a "one-tailed" hypothesis test based on the observed value of F on the actual data. You could also do a "two-tailed test".

By the way, are you aware of any test of independence that is based on the SVD or on the Non-Negative Matrix factorization?

I'm not familiar with such a tests for "categorical data" (e.g. if your bytes represent things like different medicines or names of models of cars ). Things involving matrix multiplications relate to linear transformations of coordinates. For contexts where applying a transformation of coordinates to the data makes sense (e.g. if your bytes represent spatial coordinates) there are methods of data analysis such as "principal components analysis" and "independent components" analysis". I assume that there are "well known" statistical tests that accompany these methods - but they aren't "well known" to me.

mnb96 · Apr 14, 2014

Thanks for the suggestions. I think I got the idea now.

About the SVD-like techniques (e.g. NNMF), I have seen them applied to categorical data too. One example I came across is this and it is supposedly extracted from an example found in a book. It seems that the more general definition of such statistical techniques goes under the name of "correspondence analysis".

Test for independence in contingency tables with zeros

Similar threads

Undergrad A variant of the Monty Hall problem

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad What Are the Axioms of Fuzzy Logic and How Do They Extend Boolean Algebra?

High School How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

High School Onto set mapping is the surjective set mapping, and into injective?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers