# Producing Xi squared distribution by Monte Carlo

1. Oct 28, 2011

### Whenry

Hi all,

I am using a Xi squared test to for independence of two sets of categorical data.

So lets say I have a vector x1 of 1s and 0s and x2 of 1s and 0s, and I am testing to see if x1 and x2 are independent.

and let's say, for my given data, with n = 200, I have

x1=1 x1=0
x2=1 40 80
x2=0 40 40

For this particular distribution, I get a p value of 0.0184.

How can I 'verify' this using monte carlo method?

I tired two ways so far.

First I calculated, from above, P(x2|x1) = 0.5

I then randomly generated 10000 above tables with p(x1=1) = 0.6 and p(x2=1) = 0.4.
I then looked for the number of groups which had P(x2|x1) > 0.5.

This didn't work...and I realized to I am not checking for the correct thing. But I am using Xi squared in the first place to see if the conditional probability is 'significant', so this should tell me something?

I tried another way in which I generated 10000 above tables, just as before.

The average of these tables is

48 72
32 48

so I looked for all the tables with

<40 >80
>40 <40

Now, one more related question is: if I find that I can reject the null hypothesis that x1 and x2 are independent. What do I use to measure accuracy of the calculated condition probability.

For example if I have x1 = [ zeros(1,998), 1,1] and x2 = [ zeros(1,998), 1,1] .

Then I find that I can reject the null hypothesis, but with what certainty can I say p(x1|x2) = 1?

Will

2. Oct 29, 2011

### Stephen Tashi

This is a good question. I find that writing a simulation (or just thinking about how to do it) clarifies problems in probability.

The simplistic answer is:
1) Compute the value $\chi_{data}$ of chi_square statistic on the test data you have
2) Simulate the generation of other test data under the assumption of the "null hypothesis" many times and compute the value of the chi_square statistic on each of these simulated data sets.
3) Compute what fraction of the chi_square statistics of the simulated data sets are greater than or equal to $\chi_{data}$. See if that is approximately 0.0184

The difficulty is determining what it means to simulate the tests data under the assumption of the null hypothesis. Your examples involve assumptions about conditional probability. I don't know exactly how you implemented these assumptions.
Nor do I profess to remember exactly how people who do chi-squared tests state the null I hypothesis if they are pressed to state it precisely, so all I can tell you is my guess about this.

For the data table:
_________X1=1___X1=0
X2 = 1 ---40---------80
X2 = 0 ---40---------40

I think the null hypothesis assumes that X1 will be 1 in 80 cases, that X1 will be 0 in 120 cases, that X2 will be 1 in 120 cases and that X2 will be 0 in 80 cases. Subject to those constraints, we randomly assign the cases to the 4 cells in the data table.

This is different than assuming probabilities such as P(X =1). If we said P(X=1) = 80/200 then in simulating 200 cases, we might not get exactly 80 cases where X = 1

In expositions of the chi-square test, one often reads that the null hypothesis is merely about the independence of two events. If that were true then all it would tell us in this problem is things like P(X1=1 and X2=1) = P(X1=1)P(X2=1) and it wouldn't tell us the values of these probablities. I think the proper statement of the null hypothesis is that we assume that the "marginal frequencies" of data table come out as observed. For example we assume X1 will be 1 in 80 cases. This is much more restrictive assumption that assuming that the marginal frequences are equal to the true probabilities of events such as X1=1.

So, how shall we fill the cells of the table "randomly" but still make it come out so the marginal totals are fixed?

Also, the chi-squared statistic requires that we know the expected frequency for each cell in the table. What are the expected frequencies for each cell?

Last edited: Oct 29, 2011