Producing Xi squared distribution by Monte Carlo

Click For Summary
SUMMARY

This discussion focuses on using the Monte Carlo method to verify the results of a Chi-squared test for independence between two categorical variables, x1 and x2, with a sample size of n = 200. The user initially calculated a p-value of 0.0184 and attempted to simulate data to compare conditional probabilities. The correct approach involves computing the Chi-squared statistic from the observed data, simulating data under the null hypothesis, and determining the fraction of simulated statistics that exceed the observed statistic. Key insights include the importance of maintaining fixed marginal totals when generating simulated data.

PREREQUISITES
  • Understanding of Chi-squared tests for independence
  • Familiarity with Monte Carlo simulation techniques
  • Knowledge of conditional probability concepts
  • Proficiency in statistical programming (e.g., R, Python)
NEXT STEPS
  • Learn how to implement Monte Carlo simulations for statistical testing
  • Study the assumptions and calculations involved in Chi-squared tests
  • Explore methods for generating random data while maintaining fixed marginal totals
  • Investigate techniques for estimating confidence intervals for conditional probabilities
USEFUL FOR

Statisticians, data analysts, and researchers involved in categorical data analysis and hypothesis testing will benefit from this discussion.

Whenry
Messages
22
Reaction score
0
Hi all,

I am using a Xi squared test to for independence of two sets of categorical data.

So let's say I have a vector x1 of 1s and 0s and x2 of 1s and 0s, and I am testing to see if x1 and x2 are independent.

and let's say, for my given data, with n = 200, I have

x1=1 x1=0
x2=1 40 80
x2=0 40 40


For this particular distribution, I get a p value of 0.0184.

How can I 'verify' this using monte carlo method?

I tired two ways so far.

First I calculated, from above, P(x2|x1) = 0.5

I then randomly generated 10000 above tables with p(x1=1) = 0.6 and p(x2=1) = 0.4.
I then looked for the number of groups which had P(x2|x1) > 0.5.

This didn't work...and I realized to I am not checking for the correct thing. But I am using Xi squared in the first place to see if the conditional probability is 'significant', so this should tell me something?

I tried another way in which I generated 10000 above tables, just as before.

The average of these tables is

48 72
32 48

so I looked for all the tables with

<40 >80
>40 <40


Now, one more related question is: if I find that I can reject the null hypothesis that x1 and x2 are independent. What do I use to measure accuracy of the calculated condition probability.

For example if I have x1 = [ zeros(1,998), 1,1] and x2 = [ zeros(1,998), 1,1] .

Then I find that I can reject the null hypothesis, but with what certainty can I say p(x1|x2) = 1?

Will
 
Physics news on Phys.org
Whenry said:
For this particular distribution, I get a p value of 0.0184.

How can I 'verify' this using monte carlo method?

This is a good question. I find that writing a simulation (or just thinking about how to do it) clarifies problems in probability.

The simplistic answer is:
1) Compute the value \chi_{data} of chi_square statistic on the test data you have
2) Simulate the generation of other test data under the assumption of the "null hypothesis" many times and compute the value of the chi_square statistic on each of these simulated data sets.
3) Compute what fraction of the chi_square statistics of the simulated data sets are greater than or equal to \chi_{data}. See if that is approximately 0.0184


The difficulty is determining what it means to simulate the tests data under the assumption of the null hypothesis. Your examples involve assumptions about conditional probability. I don't know exactly how you implemented these assumptions.
Nor do I profess to remember exactly how people who do chi-squared tests state the null I hypothesis if they are pressed to state it precisely, so all I can tell you is my guess about this.

For the data table:
_________X1=1___X1=0
X2 = 1 ---40---------80
X2 = 0 ---40---------40

I think the null hypothesis assumes that X1 will be 1 in 80 cases, that X1 will be 0 in 120 cases, that X2 will be 1 in 120 cases and that X2 will be 0 in 80 cases. Subject to those constraints, we randomly assign the cases to the 4 cells in the data table.

This is different than assuming probabilities such as P(X =1). If we said P(X=1) = 80/200 then in simulating 200 cases, we might not get exactly 80 cases where X = 1

In expositions of the chi-square test, one often reads that the null hypothesis is merely about the independence of two events. If that were true then all it would tell us in this problem is things like P(X1=1 and X2=1) = P(X1=1)P(X2=1) and it wouldn't tell us the values of these probablities. I think the proper statement of the null hypothesis is that we assume that the "marginal frequencies" of data table come out as observed. For example we assume X1 will be 1 in 80 cases. This is much more restrictive assumption that assuming that the marginal frequences are equal to the true probabilities of events such as X1=1.

So, how shall we fill the cells of the table "randomly" but still make it come out so the marginal totals are fixed?

Also, the chi-squared statistic requires that we know the expected frequency for each cell in the table. What are the expected frequencies for each cell?
 
Last edited:

Similar threads

  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
1
Views
639
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 10 ·
Replies
10
Views
7K
  • · Replies 4 ·
Replies
4
Views
5K