Question on Pearson's Chi-squared test

mnb96 · Feb 3, 2014

Hello,

I was trying to interpret the formula of Pearson's Chi-squared test:
[tex]\chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i}[/tex]

I thought that if we assume that each [itex]O_i[/itex] is an observation of the random variable [itex]X_i[/itex], then the above formula essentially considers the sum-of-squares of n standardized random variables [itex]Y_i=\frac{X_i-\mu_i}{\sigma_i}[/itex]. In fact, if such random variables are [itex]Y_i \sim N(0,1)[/itex], then the random variable [itex]S = \sum_{i=1}^n Y_i^2[/itex] follows a [itex]\chi^2[/itex]-distribution. Thus, the formula of the Chi-squared test would essentially evaluate the probability [itex]\mathrm{P}\left( S = \chi^2 \right)[/itex], and of course compare it to some chosen P-value.

My question is about the standardization of the random variables [itex]X_i[/itex].
If my interpretation above is correct, then Pearson's Chi-squared test somehow assumes that each random variable [itex]X_i[/itex] has variance equal to its expected value, that is: [tex]\sigma_i^2 = \mu_i[/tex]

Why so?
Can anybody explain why we would need to assume that variance and expected values are numerically equal? That condition is satisfied only for some distributions like Poisson and Gamma (with [itex]\theta=1)[/itex]. Why such a restriction?

Stephen Tashi · Feb 3, 2014

mnb96 said:

if we assume that each [itex]O_i[/itex] is an observation of the random variable [itex]X_i[/itex]

The [itex]O_i[/itex] are supposed to be a count of how many observations of a random variable fall within a "cell". How are you are defining the [itex]i[/itex]th cell?

mnb96 · Feb 4, 2014

Stephen Tashi said:

The [itex]O_i[/itex] are supposed to be a count of how many observations of a random variable fall within a "cell".

I see! That is an important observation. It probably means that the random variables [itex]X_i[/itex] are supposed to follow a multinomial distribution.

For instance, if we have only one cell, then [itex]X_1[/itex] could be the amount of successes out of [itex]m[/itex] independent trials of some experiment. Thus, [itex]X_1[/itex] would follow a binomial distribution, which in fact approaches a Poisson distribution for [itex]m[/itex] large, and which has [itex]\sigma^2=\mu=\lambda[/itex].

If the above reasoning is correct, then Pearson's Chi-squared test should work only when the number of trials is sufficiently large.

Stephen Tashi · Feb 4, 2014

mnb96 said:

It probably means that the random variables [itex]X_i[/itex] are supposed to follow a multinomial distribution.

I'm not sure what you mean by that statement.

The test can be applied to repeated independent samples of a single random variable. The single random variable can have any distribution. It is only necessary to define the cells so that they partition the range of the random variable.

mnb96 · Feb 4, 2014

Hi Stephen, and thanks for your help!

What I meant, is that [itex]X_i[/itex] is a random variable that "counts" the number of observations that happened to fall into the i-th cell. For instance, if we consider a continuous random variable Z having some unknown probability density function, and we partition the real line into two cells corresponding to the events: [itex]Z\geq 10[/itex] (=success) and [itex]Z< 10[/itex] (=failure), then the two events will have probabilities p and (1-p).

We can sample the random variable Z many times, say n times.
Now, [itex]X_1[/itex] is the random variable that keeps the total counts of successes, thus [itex]X_1[/itex] follows a binomial distribution, i.e. [itex]X_1\sim B(n,p)[/itex].

I thought that if we extend this reasoning to [itex]k[/itex] cells, then the vector of random variables [itex](X_1,\ldots,X_k)[/itex] should follow a multinomial distribution, i.e. [itex](X_1,\ldots,X_k) \sim M(n;p_1,\ldots,p_k)[/itex].

Or am I misunderstanding something?

Question on Pearson's Chi-squared test

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Graduate Expected numbers of cards of a last color remaining

Undergrad How do E[X] and E[|X|] relate?

Undergrad The problem of points

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect