Chi-square test: why does it follow a Chi-square distribution

In summary, the Chi-square test between observed and expected distributions can be interpreted as a test based on the second order Taylor approximation of the Kullback-Leibler divergence. However, in order to use this approximation for decision testing, the assumption is made that the error terms in the equation follow a normal distribution. This assumption is based on the fact that if X is a binomial random variable, its distribution can be approximated by a N(0,1) distribution. This assumption also implies that the data in the contingency table follows a multinomial distribution. Without this assumption, the calculated \chi^2 value may not follow a chi2-distribution.
  • #1
mnb96
715
5
Hello,

it is well-known that the Chi-square test between an observed distribution O and an expected distribution E can be interpreted as a test based on (twice) the second order Taylor approximation of the Kullback-Leibler divergence, i.e.: [tex]2\,\mathcal{D}_{KL}(O \| E) \approx \sum_i \frac{(O_i-E_i)^2}{E_i} = \chi^2[/tex]
where i is the bin of the histogram (or contigency table). A proof is given here (page 5).

The question is: how do we know that each of the error terms [itex]\frac{(O_i-E_i)^2}{E_i}[/itex] on the right side of the above equation follows a normal distribution N(0,1)? There is probably some some assumption to be made...?
 
Physics news on Phys.org
  • #2
mnb96 said:
The question is: how do we know that each of the error terms [itex]\frac{(O_i-E_i)^2}{E_i}[/itex] on the right side of the above equation follows a normal distribution N(0,1)?

[itex] \frac{ (O_i - E_i)^2}{E_i} [/itex] is nonnegative, so it doesn't follow a normal distribution.

If [itex] X [/itex] is a binomial random variable representing the number of "successes" n independent trials with probability of success [itex] p [/itex] on each trial then the distribution of [itex]Y = \frac {X-np}{\sqrt{np(1-p)}
} [/itex] can be approximated by a [itex] N(0,1) [/itex] distribution.
 
Last edited:
  • Like
Likes 1 person
  • #3
Stephen Tashi said:
[itex] [/itex]

If [itex] X [/itex] is a binomial random variable ...

I see. There it is our assumption!
It seems to me that such an assumption automatically implies that the data in the cells of the contingency table are assumed to follow a multinomial distribution.

So in the end, although the formula for calculating the [itex]\chi^2[/itex] value is just an approximation of the Kullback-Leibler divergence, if we are willing to perform a decision test we still need the assumption that we are dealing with a multinomial distribution, otherwise the [itex]\chi^2[/itex] value that we calculated according to the formula above, does not necessarily follow a chi2-distribution.
 

What is a Chi-square distribution?

A Chi-square distribution is a probability distribution that is used to analyze categorical data in scientific experiments. It is often used to test the independence of two categorical variables.

Why is the Chi-square test used?

The Chi-square test is used to determine if there is a significant relationship between two categorical variables. It helps scientists to understand whether or not the variables are independent of each other.

How does the Chi-square test work?

The Chi-square test works by comparing the observed frequencies of different categories to the expected frequencies. The resulting Chi-square value is then compared to a critical value to determine if there is a significant relationship between the variables.

Why does the Chi-square test follow a Chi-square distribution?

The Chi-square test follows a Chi-square distribution because the test statistic, which is calculated using the Chi-square formula, has a distribution that closely approximates the Chi-square distribution. This allows scientists to use critical values from the Chi-square distribution to determine the significance of their results.

What are the assumptions for using the Chi-square test?

The main assumptions for using the Chi-square test include having independent samples, having a large enough sample size, and having expected frequencies of at least 5 in each category. It is also important to ensure that the data is not skewed or biased in any way.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
812
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
919
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
856
  • Set Theory, Logic, Probability, Statistics
Replies
20
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
Back
Top