Central Limit Theorem applied to a Bernoulli distribution

TheOldHag · Jun 10, 2014

As I understand it, one result of the central limit theorem is that the sampling distribution of means drawn from any population will be approximately normal. Assume the population consist of Bernoulli trials with a given probability p and we want to estimate p. Then our population consist of zeros and ones and our sample means will be in the interval [0,1]. So I'm a bit confused how such a distribution of sample means can be approximated by a normal distribution which is never 0 in the interval (-infinity, infinity).

Is it because the probability is very small outside of [0,1] and thereby still a good approximation. Or am I not understanding the central limit theorem. I have googled a bit and most of what I have found is applying the central limit theorem to the binomial distribution. But this seems to be the easier case and my guess is having to estimate a proportion is very common. Any insight is appreciated.

jbunniii · Jun 10, 2014

It is true that the sample mean ##s_n = \sum_{k=1}^{n} x_n## will be in the interval ##[0,1]##. The central limit theorem says that ##\sqrt{n}(s_n - \mu) \rightarrow N(0, \sigma)## (convergence in distribution). Equivalently, the distribution of ##s_n## is approximated by ##N(\mu,\sigma/\sqrt{n})## for large ##n##. As ##n## increases, ##N(\mu, \sigma/\sqrt{n})## becomes narrowly centered around ##\mu##, and the tails (e.g. the portion of the pdf outside a narrow interval around ##\mu##) become very small. This is reasonable because ##s_n## converges with probability 1 to ##\mu## by the law of large numbers, so we would expect that for large ##n##, its pdf should largely confined to a narrow interval around ##\mu##.

Stephen Tashi · Jun 10, 2014

TheOldHag said:

will be approximately normal.

To add to what Jbunniii said, the technical definition of "will be approximately" involves the concept of convergence of a sequence of functions to a function and the further complication of defining a convergence of a sequence of random variables to another random variable. These concepts are more complicated than the concept of the convergence of a sequence of numbers to another number (i.e. the ordinary idea of limit studied in first semester calculus).

Of course, you can find an article about the convergence of sequences of random variables in the Wikipedia. If you want to pursue this, you should begin with an exact statement of what the Central Limit Theorem says.

jbunniii · Jun 10, 2014

One more thing I would like to mention: observe that as long as the ##x_n## are independent and share a common mean ##\mu## and standard deviation ##\sigma##, then ##E[s_n] = \mu## and ##\text{var}(s_n) = \sigma^2/n##, and so
$$E[\sqrt{n}(s_n - \mu)] = \sqrt{n}(E[s_n] - \mu) = \sqrt{n}(\mu - \mu) = 0$$
and
$$\begin{align}
\text{var}(\sqrt{n}(s_n - \mu)) &= E[n(s_n - \mu)^2] \\
&= n(E[s_n^2] - 2\mu E[s_n] + \mu^2) \\
&= n(E[s_n^2] - \mu^2) \\
&= n(\text{var}(s_n)) \\
&= n(\sigma^2/n) \\
&= \sigma^2\end{align}$$
So even without taking any limit, we know that ##\sqrt{n}(s_n - \mu)## must have mean ##0## and standard deviation ##\sigma## for every ##n##. Thus it's no surprise that the limit distribution (if it exists) must also have this mean and standard deviation. So the real content of the CLT is that if we add the hypothesis that the ##x_n## are identically distributed, then (1) the sequence does converge in distribution and (2) the limit distribution is normal.

One last thing to note is that convergence in distribution means that the sequence of cdf's (cumulative distribution functions) converges pointwise. Thus the fact a Bernoullli random variable, or any finite sum of them, is discrete whereas a normal distribution is continuous, does not cause any difficulty. We do NOT claim that the sequence of probability mass functions of the Bernoulli sums somehow converges to
$$\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{x^2}{2\sigma^2}\right)$$

Hornbein · Jun 10, 2014

TheOldHag said:

As I understand it, one result of the central limit theorem is that the sampling distribution of means drawn from any population will be approximately normal.

This isn't true. There is a condition that is easily met, so don't worry about it. A distribution that does not meet the condition is X/Y where both are standard normal variables with mean zero. When Y is close to zero you will get freakish results. This distribution doesn't even have a mean. But this sort of thing never seems to show up in real life.

If you meet that condition, then the sample means converge to a normal. It may take a long time for it to converge, but most people are interested in bounded distributions without extreme skew, so in practice things usually converge quite quickly.

TheOldHag said:

Assume the population consist of Bernoulli trials with a given probability p and we want to estimate p. Then our population consist of zeros and ones and our sample means will be in the interval [0,1]. So I'm a bit confused how such a distribution of sample means can be approximated by a normal distribution which is never 0 in the interval (-infinity, infinity).

The normal distribution is an abstract mathematical model. Real life NEVER matches it exactly. This is applied mathematics so when they say "the distribution is normal" they really mean approximately normal. So don't worry about it, unless you have a Bernoulli distribution with p close to 0 or 1. Then you have a skewed distribution and will have to have a large sample size. If p is close to zero then you need a sample size like 15/p. If p=10^-9 you will have extreme skew and will need a very large sample.

micromass · Jun 11, 2014

In order to illustrate the answers in this thread, I have made some simulations.

The first thing I did was to simulate a basic Bernouilli(1/2) variable. So I generated 10000 data points which were distributed Bernouilli(1/2). This is the plot of the frequencies of the data points:

attachment.php?attachmentid=70521&stc=1&d=1402500631.png

As we expect, about half of the data points are in 0, and half are in 1.

Then I took 2 data points who were Bernouilli(1/2) distributed and took the average of the two. I did this 10000 times. This is the plot of the frequencies:

attachment.php?attachmentid=70522&stc=1&d=1402500631.png

I then took 5 data points who were Bernouilli(1/2) distributed and took the average of the 5. I did this 10000 times. This is the plot of the frequencies:

attachment.php?attachmentid=70523&stc=1&d=1402500631.png

I then took 10 data points who were Bernouilli(1/2) distributed and took the average of the 10. I did this 10000 times. This is the plot of the frequencies:

attachment.php?attachmentid=70524&stc=1&d=1402500631.png

I then took 100 data points who were Bernouilli(1/2) distributed and took their average. I did this 10000 times. This is the plot of the frequences:

attachment.php?attachmentid=70525&stc=1&d=1402500631.png

I then took 1000 data points who were Bernouilli(1/2) distributed and took their average. I did this 10000 times. This is the plot of the frequencies:

attachment.php?attachmentid=70526&stc=1&d=1402500631.png

As you see, the mean of the distribution translates to the right if I take enough data points in the average. You also see that I get very close to a normal distribution. Even if I only take averages of 10 data points, I'm already quite close! The variance also gets smaller each time (relative to the total population).

So while you are right that I can never generate negative values. We do see that the population translates to the right and since the variance gets (relatively) small. Thus the theoretical contribution from the negative values in the normal distribution gets closer and closer to 0.

jbunniii · Jun 11, 2014

@micro - Based on the x-axis in your plots, I think your simulation is summing the Bernoulli trials, not averaging them, correct?

I also noticed that I omitted a ##1/n## scale factor in my definition of ##s_n## in post #2. It should have been
$$s_n = \frac{1}{n}\sum_{k=1}^{n} x_n$$
The rest of my remarks assumed we were talking about averages, not sums.

If we sum instead of averaging, we get similar results, except the mean and standard deviation of ##\sum_{k=1}^n x_n## are ##n \mu## and ##\sqrt{n}\sigma## instead of ##\mu## and ##\sigma/\sqrt{n}##. However, the sums do NOT converge in distribution.

micromass · Jun 11, 2014

jbunniii said:

@micro - Based on the x-axis in your plots, I think your simulation is summing the Bernoulli trials, not averaging them, correct?

Yes, that is correct. Here are the plots for the averages:

The averages of the above plots remain the same now (as 1/2). But the variances decrease each time. So the corresponding normal distributions have variances which decrease. So the negative points have probabilities which decrease to 0.