Producing Xi squared distribution by Monte Carlo

Whenry · Oct 28, 2011

Hi all,

I am using a Xi squared test to for independence of two sets of categorical data.

So let's say I have a vector x1 of 1s and 0s and x2 of 1s and 0s, and I am testing to see if x1 and x2 are independent.

and let's say, for my given data, with n = 200, I have

x1=1 x1=0
x2=1 40 80
x2=0 40 40

For this particular distribution, I get a p value of 0.0184.

How can I 'verify' this using monte carlo method?

I tired two ways so far.

First I calculated, from above, P(x2|x1) = 0.5

I then randomly generated 10000 above tables with p(x1=1) = 0.6 and p(x2=1) = 0.4.
I then looked for the number of groups which had P(x2|x1) > 0.5.

This didn't work...and I realized to I am not checking for the correct thing. But I am using Xi squared in the first place to see if the conditional probability is 'significant', so this should tell me something?

I tried another way in which I generated 10000 above tables, just as before.

The average of these tables is

48 72
32 48

so I looked for all the tables with

<40 >80
>40 <40

Now, one more related question is: if I find that I can reject the null hypothesis that x1 and x2 are independent. What do I use to measure accuracy of the calculated condition probability.

For example if I have x1 = [ zeros(1,998), 1,1] and x2 = [ zeros(1,998), 1,1] .

Then I find that I can reject the null hypothesis, but with what certainty can I say p(x1|x2) = 1?

Will

Stephen Tashi · Oct 29, 2011

Whenry said:

For this particular distribution, I get a p value of 0.0184.

How can I 'verify' this using monte carlo method?

This is a good question. I find that writing a simulation (or just thinking about how to do it) clarifies problems in probability.

The simplistic answer is:
1) Compute the value \chi_{data} of chi_square statistic on the test data you have
2) Simulate the generation of other test data under the assumption of the "null hypothesis" many times and compute the value of the chi_square statistic on each of these simulated data sets.
3) Compute what fraction of the chi_square statistics of the simulated data sets are greater than or equal to \chi_{data}. See if that is approximately 0.0184

The difficulty is determining what it means to simulate the tests data under the assumption of the null hypothesis. Your examples involve assumptions about conditional probability. I don't know exactly how you implemented these assumptions.
Nor do I profess to remember exactly how people who do chi-squared tests state the null I hypothesis if they are pressed to state it precisely, so all I can tell you is my guess about this.

For the data table:
_________X1=1___X1=0
X2 = 1 ---40---------80
X2 = 0 ---40---------40

I think the null hypothesis assumes that X1 will be 1 in 80 cases, that X1 will be 0 in 120 cases, that X2 will be 1 in 120 cases and that X2 will be 0 in 80 cases. Subject to those constraints, we randomly assign the cases to the 4 cells in the data table.

This is different than assuming probabilities such as P(X =1). If we said P(X=1) = 80/200 then in simulating 200 cases, we might not get exactly 80 cases where X = 1

In expositions of the chi-square test, one often reads that the null hypothesis is merely about the independence of two events. If that were true then all it would tell us in this problem is things like P(X1=1 and X2=1) = P(X1=1)P(X2=1) and it wouldn't tell us the values of these probablities. I think the proper statement of the null hypothesis is that we assume that the "marginal frequencies" of data table come out as observed. For example we assume X1 will be 1 in 80 cases. This is much more restrictive assumption that assuming that the marginal frequences are equal to the true probabilities of events such as X1=1.

So, how shall we fill the cells of the table "randomly" but still make it come out so the marginal totals are fixed?

Also, the chi-squared statistic requires that we know the expected frequency for each cell in the table. What are the expected frequencies for each cell?

Producing Xi squared distribution by Monte Carlo

Similar threads

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad A variant of the Monty Hall problem

High School How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

High School Onto set mapping is the surjective set mapping, and into injective?

Undergrad How do E[X] and E[|X|] relate?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers