Standard Deviation as Function of Sample Size

In summary: As for Jensen's inequality and such, I'm not saying that the variance and standard deviations would rise by the same amount as N increases, only that if one increases the other must do as well and that if one stays constant, the other does as well.That is incorrect. I will let others talk.In summary, the conversation discusses the relationship between sample size and standard deviation. While in high school, it was taught that the standard deviation decreases as sample size increases, it is later clarified that it is the standard error that decreases. However, the population standard deviation remains constant. The ISOBudgets website provides a function to determine the minimum sample size needed to achieve a desired standard deviation. The conversation also touches on the
  • #1
Roger Dodger
42
3
In high school, I was taught that the standard deviation drops as you increase the sample size. For this reason, larger sample sizes produce less fluctuation. At the time, I didn't question this because it made sense.

Then, I was taught that the standard deviation does not drop as you increase sample size. Rather, it was the standard error that dropped. Sounded fine to me. And most of the resources I find agree -- the standard deviation might fluctuate slightly, but it does not drop with increasing sample size.

But today I came across the ISOBudgets: http://www.isobudgets.com/introduction-statistics-uncertainty-analysis/#sample-size. Here, it states "Have you ever wanted to reduce the magnitude of your standard deviation? Well, if you know how small you want the standard deviation to be, you can use this function to tell you how many samples you will need to collect to achieve your goal." It goes on to provide such a function for finding this minimum n, namely that √n = (desired confidence level) X (current standard deviation) / ( margin of error).

Am I misreading this? I used a random number generator to punch out about 500 random numbers on a normalized distribution and the standard deviation does not drop. What am I missing?
 
Physics news on Phys.org
  • #2
This is one of those things that probably needs written out mathematically --

standard deviation of what exactly?

I'd also suggest for now: focus on variance, not standard deviation, as variance plays better with linearity. (You can always take a square root at the end in the privacy of your own home.)

For example you could be talking about the variance of a sum of random variables. Or the variance of a rescaled (say divide by n) sum of random variables. Or not the variance in the underlying random variables but in your estimate of some attribute of them. Or ...
 
  • Like
Likes Klystron
  • #3
CORRECTION: This post is wrong. Please ignore it.

The population standard deviation, ##\sigma##, of the probability distribution does not change. However, the sample standard deviation that estimates ##\sigma## using the formula
$$ S = \sqrt{ \frac {{\sum_{i=1}^N (x_i - \bar x)^2}}{N-1}} $$
does decrease as N increases. It becomes a better estimater of ##\sigma##.
 
  • #4
FactChecker said:
The population standard deviation, ##\sigma##, of the probability distribution does not change. However, the sample standard deviation that estimates ##\sigma## using the formula
$$ S = \sqrt{ \frac {{\sum_{i=1}^N (x_i - \bar x)^2}}{N-1}} $$
does decrease as N increases. It becomes a better estimater of ##\sigma##.

But isn't the only difference between the two the fact that you divide √N in one case and √(N-1) in the other? That means for large N they pretty much behave the same. Or am I misunderstanding your point?
 
  • #5
StoneTemplePython said:
This is one of those things that probably needs written out mathematically --

standard deviation of what exactly?

I'd also suggest for now: focus on variance, not standard deviation, as variance plays better with linearity. (You can always take a square root at the end in the privacy of your own home.)

For example you could be talking about the variance of a sum of random variables. Or the variance of a rescaled (say divide by n) sum of random variables. Or not the variance in the underlying random variables but in your estimate of some attribute of them. Or ...

As for your first question, I am referring to the standard deviation of N measurements of (say) the mass of an object, with an underlying normal probability distribution understood.

As for the variance, why would it behave any different than the standard deviation? I mean, if the variance stays constant for increasing N, wouldn't the standard deviation as well?
 
  • #6
I'm sorry, I got sloppy. My post was wrong. I would like to delete it.
 
  • #7
FactChecker said:
I'm sorry, I got sloppy. My post was wrong. I would like to delete it.

That's cool. You should be able to, or at least edit out the text.
 
  • #8
Roger Dodger said:
As for your first question, I am referring to the standard deviation of N measurements of (say) the mass of an object, with an underlying normal probability distribution understood.
Can you write it out mathematically? I think I know what you mean but it is not what you just said here.
Roger Dodger said:
As for the variance, why would it behave any different than the standard deviation? I mean, if the variance stays constant for increasing N, wouldn't the standard deviation as well?

In this realm, people trip up on Jensen's Inequality and Triangle Inequality, a lot. Trying to preserve linearity is worth it. (Then just take square root at the end).

caveat: I may decide your question is different than what I thought and decide std deviation is nice to work with directly -- rare but it happens.
 
  • #9
StoneTemplePython said:
Can you write it out mathematically? I think I know what you mean but it is not what you just said here.

In this realm, people trip up on Jensen's Inequality and Triangle Inequality, a lot. Trying to preserve linearity is worth it. (Then just take square root at the end).

Sure, here is the standard deviation:

$$\sigma = \sqrt{ \frac{\sum_{i=1}^N (x_i-\langle x\rangle)^2}{N-1} }$$

where $N$ is the sample size, $x_i$ is an individual measurement, and $\langle x\rangle$ is the mean of all measurements.
 
  • #10
Roger Dodger said:
Sure, here is the standard deviation:

$$\sigma = \sqrt{ \frac{\sum_{i=1}^N (x_i-\langle x\rangle)^2}{N-1} }$$

where $N$ is the sample size, $x_i$ is an individual measurement, and $\langle x\rangle$ is the mean of all measurements.

As for Jensen's inequality and such, I'm not saying that the variance and standard deviations would rise by the same amount as N increases, only that if one increases the other must do as well and that if one stays constant, the other does as well.
 
  • #11
Roger Dodger said:
Sure, here is the standard deviation:

$$\sigma = \sqrt{ \frac{\sum_{i=1}^N (x_i-\langle x\rangle)^2}{N-1} }$$

where $N$ is the sample size, $x_i$ is an individual measurement, and $\langle x\rangle$ is the mean of all measurements.

This is getting closer but still doesn't mathematically state your problem. So you have a random variable ##X## with a variances called ##\sigma_X^2##. I think you are talking about sampling -- specifically ##n## iid trials, and you want to estimate ##E\big[X\big]## and ##\sigma_X^2 = E\big[X^2\big] - E\big[X\big]^2##. Is that what your goal is?

The idea here is you need to estimate the mean ##E\big[X\big]## and the second moment ##E\big[X^2\big]## or ##E\big[Y\big]## where ##Y = X^2## if you like. I assume both of these exist.

As you get larger and larger samples, your estimates will concentrate about the mean. Pick a favorite limit law. As estimates concentrate about the mean, the 'distance' between the estimates and the correct value goes down. Hence the variance of your estimate (read: squared 2 norm of difference between estimates and true value) goes down.

I wouldn't worry about the divide by ##n## vs divide by ##n-1## issue here.
- - - -
It could be instructive to work through a fully fleshed problem, both the math and a simulation, involving coin tossing and estimating mean and variance. Since coin tosses have bounded (specifically 0 or 1 only in this case) results, you can get very sharp estimates on concentration about the mean via Chernoff bounds.
 
  • #12
StoneTemplePython said:
This is getting closer but still doesn't mathematically state your problem. So you have a random variable ##X## with a variances called ##\sigma_X^2##. I think you are talking about sampling -- specifically ##n## iid trials, and you want to estimate ##E\big[X\big]## and ##\sigma_X^2 = E\big[X^2\big] - E\big[X\big]^2##. Is that what your goal is?

The idea here is you need to estimate the mean ##E\big[X\big]## and the second moment ##E\big[X^2\big]## or ##E\big[Y\big]## where ##Y = X^2## if you like. I assume both of these exist.

As you get larger and larger samples, your estimates will concentrate about the mean. Pick a favorite limit law. As estimates concentrate about the mean, the 'distance' between the estimates and the correct value goes down. Hence the variance of your estimate (read: squared 2 norm of difference between estimates and true value) goes down.

I wouldn't worry about the divide by ##n## vs divide by ##n-1## issue here.
- - - -
It could be instructive to work through a fully fleshed problem, both the math and a simulation, involving coin tossing and estimating mean and variance. Since coin tosses have bounded (specifically 0 or 1 only in this case) results, you can get very sharp estimates on concentration about the mean via Chernoff bounds.

I used the NORMINV[] function in Excel to generate random numbers that are distributed normally about a mean. The distribution is continuous, unlike coin flips. As more and more numbers are generated, the standard deviation doesn't drop. Most say that as I increase the sample size, that the estimates will concentrate closer to the mean. But it doesn't appear that is happening.
 
  • #13
Roger Dodger said:
I used the NORMINV[] function in Excel to generate random numbers that are distributed normally about a mean. The distribution is continuous, unlike coin flips. As more and more numbers are generated, the standard deviation doesn't drop. Most say that as I increase the sample size, that the estimates will concentrate closer to the mean. But it doesn't appear that is happening.

I think you're making a mistake (a) using Excel (hard for others to replicate, not known for being good in simulations -- a free excel add-on called PopTools is decent though) and (b) starting with a normal distribution instead of something simpler, and in particular coin tossing (and yes that can be normal approximated but that's a different topic).

Assuming the moments exist, these limit laws are ironclad things which tells me you're doing something wrong here but I can't read minds. The point is the variance/variation in your estimates comes down for bigger sample sizes. You may consider trying a simulation with ##n=10## data points instead of ##n=500## and comparing the mean and variances estimates as well as the variance in said estimates after running many trials. My gut tells me that you're not calculating things in the way I would, but this is an ironclad difference between pasting in a few lines of code for others to look at, vs working in Excel.

- - - -
it also occurs to me that the normal distribution may be too 'nice' and hence you can miss subtle differences. Again with coin tossing in mind, consider a very biased coin that has a value of 1 with probability ##p = 10^{-2}## and a value of 0 aka tails with probability ##1 - p##. Consider the variation in your estimates of the mean and variance of said coin when you run 10,000 trials with each trial having say ##100## tosses, vs running 10,000 trials with each trial having ##10,000## tosses in it.
 
  • #14
Thanks, I'll look into it.
 
  • #15
Roger Dodger said:
As for your first question, I am referring to the standard deviation of N measurements of (say) the mass of an object, with an underlying normal probability distribution understood.
You're failing to cope with the complicated vocabulary of statistics.

Roger Dodger said:
Sure, here is the standard deviation:

$$\sigma = \sqrt{ \frac{\sum_{i=1}^N (x_i-\langle x\rangle)^2}{N-1} }$$

That is an estimator for the standard deviation of the mean of N measurements. Some also call it the "sample standard deviation". Other's reserve the term "sample standard deviation" for the similar expression with N instead of N-1 in the denominator.

The mean of a sample of N measurements is a random variable. It has a standard deviation (namely, the standard deviation of its probability distribution). That standard deviation not a function of the values ##X_i## obtained in one particular sample.

Then, I was taught that the standard deviation does not drop as you increase sample size.
Which standard deviation are you talking about?

If ##X## is a random variable with standard deviation ##\sigma_X## then taking 100 samples of ##X## does not change the standard deviation of ##X##, but the random variable ##Y## defined by the mean of 100 samples of ##X## has a smaller standard deviation than ##X##. Neither of these standard deviations is the same as an estimator of a standard deviation. An estimator of a standard deviation is itself a random variable. It isn't a constant parameter associated with a probability distribution.
 
  • #16
Interesting. I appreciate the feedback. So let me ask, how would YOU define the standard deviation?
 
  • #17
Roger Dodger said:
Interesting. I appreciate the feedback. So let me ask, how would YOU define the standard deviation?

That's like asking "How do you find the place?". It isn't a specific question.

The use of "the" in the phase "the standard deviation" suggests that it can only refer to a single thing. That is not the case. The phrase "the standard deviation" is unspecific. As @StoneTemplePython said in post #2, standard deviation of what exactly?

Terms like "standad deviation", "average", "mean" etc. have at least 3 possible meanings

1. They may refer to speciific number obtained in a specific sample -e.g. "The mean weight in the sample 5 apples was .2 kg"
2. They may refer to a specific number that gives the value of the parameter associated with a probability distribution "e.g. We assume the population of apples has a normal distribution with mean 0.2"
3. They may refer to a random variable. For example "The distribution of the mean of a sample of 5 apples taken from a population of apples with mean 0.2 also has a mean of 0.2. (So we can speak of "the mean of a mean", "the standard deviation of a mean", "the standard deviation of a standard deviation" etc.)
 
  • #18
When you wrote "That is an estimator for the standard deviation of the mean of N measurements," what definition were you using?
 
  • #19
Roger Dodger said:
When you wrote "That is an estimator for the standard deviation of the mean of N measurements," what definition were you using?

An estimator is a function of the values obtained in a sample. For example,
$$\sigma = \sqrt{ \frac{\sum_{i=1}^N (x_i-\langle x\rangle)^2}{N-1} }$$
is an estimator because it depends on the values ##x_i## obtained in a sample.

Since the practical use of estimators is to estimate the parameters of proability distsributions, we can speak of an "estimator of the standard deviation" or "an estimator of the variance" etc.

As you know, the mean of sample of N things chosen from a population need not be exactly equal to the mean of the population. The mean of the sample is used as an estimator of the mean of the population. Likewise we can define estimators for the standard deviation and variance.

A complicated question in statistics is decide what formula is the "best" estimator for a population parameter. More vocabulary is needed in order to be specific about what is meant by "best". There are "unbiased estimators", "minimum variance estimators", "consistent estimators", "maximum liklihood estimators".

Furthermore , an estimator is itself a random variable because it depends on the random values that occur in a sample. So an estimator has a mean, variance, standard deviation etc. just like other random variables.
 
  • #20
Okay, let's go back to the equation:

$$\sigma = \sqrt{ \frac{\sum_{i=1}^N (x_i-\langle x\rangle)^2}{N-1} }$$

Let's not even use the term "standard deviation." Here, N is the number of samples selected from a population. The average in the equation refers to the sample mean. The x_i's are generated from a process modeled by a normal distribution.

What likely happens to $\sigma$ in the above equation as we increase N, that is, we select more and more samples from the population? Yes, the sample mean changes in value. Understood. If it cannot be said one way or the other (that is, it could go up or down), I'm good with that.
 
  • #21
Roger Dodger said:
What likely happens to $\sigma$ in the above equation as we increase N, that is, we select more and more samples from the population?

( It's traditional to put a "hat" on random variables representing estimators. So ##\hat{\sigma}## would be a better notation. However, let's use your notation.)

The graph of the probability density of the estimator ##\sigma## (as you define ##\sigma##) has a peak near the value equal to the standard deviation ##\sigma_p## of the normal distribution from which the ##x_i## are chosen. As ##N## gets larger, this peak gets taller and narrower. Hence , as N gets larger, it is more probable that ##\sigma## will have a value near ##\sigma_p##.We can contrast this with the estimator ##\mu = \frac{ \sum_{i=1}^N x_i}{N}##. The probability density of ##\mu## has a peak at ##\mu_p## = the mean of the normal distribution from which the ##x_i## are chosen. As ##N## increases this peak becomes taller and narrower. The narrowness of the peak is indicated by the standard deviation of ##\mu##, which is ##\sigma_{\mu} = \frac{ \sigma_p}{\sqrt{N}}##. The standard deviation of the distribution of ##\mu## gets smaller as ##N## becomes larger. As ##N## becomes larger it becomes more probable that the value of ##\mu## will be close to ##\mu_p##.
 
  • #22
Stephen Tashi said:
( It's traditional to put a "hat" on random variables representing estimators. So ##\hat{\sigma}## would be a better notation. However, let's use your notation.)

The graph of the probability density of the estimator ##\sigma## (as you define ##\sigma##) has a peak near the value equal to the standard deviation ##\sigma_p## of the normal distribution from which the ##x_i## are chosen. As ##N## gets larger, this peak gets taller and narrower. Hence , as N gets larger, it is more probable that ##\sigma## will have a value near ##\sigma_p##.We can contrast this with the estimator ##\mu = \frac{ \sum_{i=1}^N x_i}{N}##. The probability density of ##\mu## has a peak at ##\mu_p## = the mean of the normal distribution from which the ##x_i## are chosen. As ##N## increases this peak becomes taller and narrower. The narrowness of the peak is indicated by the standard deviation of ##\mu##, which is ##\sigma_{\mu} = \frac{ \sigma_p}{\sqrt{N}}##. The standard deviation of the distribution of ##\mu## gets smaller as ##N## becomes larger. As ##N## becomes larger it becomes more probable that the value of ##\mu## will be close to ##\mu_p##.

Okay, so you state that "As ##N## gets larger, this peak gets taller and narrower." That makes sense and what I always thought. Would this not mean that ##\sigma## drops as N increases? I would normally think that a tall, narrow probability distribution would correspond to a small ##\sigma##.
 
  • #23
Roger Dodger said:
Would this not mean that ##\sigma## drops as N increases?

No. ##\sigma## is not a single number. ##\sigma## is a random variable. The graph of the probability distribution of ##\sigma## has peak in probability (the ##y## value) near the ##x## value ## \sigma_p##

You can see graphs of probability densities for ##s = \sqrt{ \frac{ \sum_{i=1}^N (x_i - <x>)^2}{N}} ## at http://mathworld.wolfram.com/StandardDeviationDistribution.html That page makes the tacit assumption that ##\sigma_p = 1##. Those graphs are similar to the probability densities for ##\sigma##.
 
  • #24
So is the mean a random variable and not a number as well?
 
  • #25
By "the mean", are you referring to the estimator of the population mean defined in post #21 Yes, that estimator of the mean, (which is called "the sample mean") is a random varible. It varies from sample to sample.

As a random variable, the sample mean has a distribution. That distribution has it's own mean - so there is a "mean of the sample mean". The mean of the distribution of the sample mean is a single number. It turns out that the mean of the distribution of the sample mean is equal to the mean of the population from which we are taking independent samples. For that reason, the sample mean is called an "unbiased" estimator of the population mean.
 

1. What is Standard Deviation as a Function of Sample Size?

Standard deviation as a function of sample size is a statistical concept that measures the amount of variation or spread in a set of data. It is calculated by taking the square root of the variance of the data. As the sample size increases, the standard deviation tends to decrease, indicating a more precise measurement of the data.

2. Why is Standard Deviation as a Function of Sample Size important?

Standard deviation as a function of sample size is important because it helps to determine how reliable or representative the sample is of the entire population. A larger sample size with a smaller standard deviation indicates a more accurate representation of the population, while a smaller sample size with a larger standard deviation may not accurately reflect the true population.

3. How is Standard Deviation as a Function of Sample Size calculated?

To calculate standard deviation as a function of sample size, the sample size, mean, and individual data points are needed. The formula is: standard deviation = √(∑(x - x̄)^2 / (n-1)), where x represents the individual data points, x̄ represents the mean, and n represents the sample size.

4. What is the relationship between Sample Size and Standard Deviation?

The relationship between sample size and standard deviation is an inverse relationship. As the sample size increases, the standard deviation decreases. This means that a larger sample size provides a more accurate and precise representation of the population, resulting in a smaller standard deviation.

5. How does Standard Deviation as a Function of Sample Size impact statistical analysis?

Standard deviation as a function of sample size impacts statistical analysis by providing a measure of the reliability and precision of the data. It helps to identify any outliers or extreme values in the data set, which can affect the overall results and conclusions drawn from the data. Additionally, it allows for the comparison of different data sets with varying sample sizes, as the standard deviation can be used to determine which data set is more representative of the population.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
907
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
985
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
18
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
723
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
Back
Top