nraic said:
Hi,
I am doing an undergraduate introductory statistics course and I'm trying to understand some basic concepts.
I'm trying to understand why the sample size (n) affects the standard deviation of the sampling distribution of the mean (σ_{M})
I understand how a sample size affects the sampling distribution of the mean. I've been shown that with larger sample sizes the standard deviation decreases. This can be seen graphically the normal distribution curve of the samples mean becoming more narrow as the sample size increases.
σ_{M} = σ/\sqrt{n}
What I don't understand is why this is happening.
I have this intuitive feeling that if you take an infinite number of samples means they should have a fixed mean and standard deviation and that this shouldn't be different if you take samples of n=10 or n=100. I've been shown that this is wrong but I don't understand why.
Hey nraic and welcome to the forums.
Let's assume we have a consistent, unbiased estimator. If you don't know what these are then I suggest you try wikipedia and maybe other books. These are the estimators that are used in practice because they are actually useful for statistics.
The simple idea for the variance of the estimator is to shrink is that more data gives us a better estimate, and that if we have more data, then the range of where most of the probable values for the estimator to lie in shrinks and this is reflected in the variance shrinking as a result of dividing by square root of n.
The idea is based on the law of large numbers.
What this says intuitively is that the more data we collect for something and then average out the sum, the closer and closer that this average goes to the true average of the distribution and not just for a specific distribution, but for any distribution. Basically it's a convergence result that says as you collect more and more data about the distribution then if you average all the numbers you'll get a better approximation for the mean.
Now think about this in terms of the variance: this means that if the estimator is consistent and unbiased then as we get more information the variance shrinks which means we should get more certainty in our information for trying to estimate the mean.
Since variance is one measure for measuring uncertainty, it is no surprise intuitively that the variance gets lower as we get more information in the terms of number of data points.
If the variance went up then that would mean that our guess for the mean would become more uncertain which means that we would be better off getting less information! This doesn't make sense intuitively.
Also if the variance stayed the same then it means that no extra information would give us a more accurate guess for the mean given some fixed level of confidence which means it would be pointless getting more data.
So think about it in terms of the fact that getting more data should tell us more about what we are trying to measure and reduce uncertainty about trying to measure it and this translates into lower variance as one measure of reduced uncertainty.
I want to say something though that I think you should hear because you may get the wrong idea if I don't say it.
The thing about the estimator is that no matter how many data points we get from an unknown process (what the data is talking about), you will never be able to 100% say what the mean is at a point or even in an interval. Your 95% interval will shrink a lot when you have collected 100,000 data points, but the 100% interval will always be every point on the real line.
To understand this think about if you have a process and the real mean is 0. In the first hundred thousand data points they could all be positive and give an estimate that is also positive. This could happen for a million, a billion, even a googleplex number of times!
But then after this you might get a whole lot of negative values for the same number of times and if this is taken into account then your estimate shifts from positive towards zero.
This is why for a truly unknown process you will never be able to have a fixed interval for 100% confidence.
So in conclusion: more information about a process (higher sample) should by most means give us more certainty (not not 100% certainty of course) about where we would most likely expect the mean to lie (because variance shrinks).