Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

B Standard Deviation as Function of Sample Size

  1. Aug 8, 2018 #1
    In high school, I was taught that the standard deviation drops as you increase the sample size. For this reason, larger sample sizes produce less fluctuation. At the time, I didn't question this because it made sense.

    Then, I was taught that the standard deviation does not drop as you increase sample size. Rather, it was the standard error that dropped. Sounded fine to me. And most of the resources I find agree -- the standard deviation might fluctuate slightly, but it does not drop with increasing sample size.

    But today I came across the ISOBudgets: http://www.isobudgets.com/introduction-statistics-uncertainty-analysis/#sample-size. Here, it states "Have you ever wanted to reduce the magnitude of your standard deviation? Well, if you know how small you want the standard deviation to be, you can use this function to tell you how many samples you will need to collect to achieve your goal." It goes on to provide such a function for finding this minimum n, namely that √n = (desired confidence level) X (current standard deviation) / ( margin of error).

    Am I misreading this? I used a random number generator to punch out about 500 random numbers on a normalized distribution and the standard deviation does not drop. What am I missing?
     
  2. jcsd
  3. Aug 8, 2018 #2

    StoneTemplePython

    User Avatar
    Science Advisor
    Gold Member

    This is one of those things that probably needs written out mathematically --

    standard deviation of what exactly?

    I'd also suggest for now: focus on variance, not standard deviation, as variance plays better with linearity. (You can always take a square root at the end in the privacy of your own home.)

    For example you could be talking about the variance of a sum of random variables. Or the variance of a rescaled (say divide by n) sum of random variables. Or not the variance in the underlying random variables but in your estimate of some attribute of them. Or ...
     
  4. Aug 8, 2018 #3

    FactChecker

    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    CORRECTION: This post is wrong. Please ignore it.

    The population standard deviation, ##\sigma##, of the probability distribution does not change. However, the sample standard deviation that estimates ##\sigma## using the formula
    $$ S = \sqrt{ \frac {{\sum_{i=1}^N (x_i - \bar x)^2}}{N-1}} $$
    does decrease as N increases. It becomes a better estimater of ##\sigma##.
     
  5. Aug 8, 2018 #4
    But isn't the only difference between the two the fact that you divide √N in one case and √(N-1) in the other? That means for large N they pretty much behave the same. Or am I misunderstanding your point?
     
  6. Aug 8, 2018 #5
    As for your first question, I am referring to the standard deviation of N measurements of (say) the mass of an object, with an underlying normal probability distribution understood.

    As for the variance, why would it behave any different than the standard deviation? I mean, if the variance stays constant for increasing N, wouldn't the standard deviation as well?
     
  7. Aug 8, 2018 #6

    FactChecker

    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    I'm sorry, I got sloppy. My post was wrong. I would like to delete it.
     
  8. Aug 8, 2018 #7
    That's cool. You should be able to, or at least edit out the text.
     
  9. Aug 8, 2018 #8

    StoneTemplePython

    User Avatar
    Science Advisor
    Gold Member

    Can you write it out mathematically? I think I know what you mean but it is not what you just said here.


    In this realm, people trip up on Jensen's Inequality and Triangle Inequality, a lot. Trying to preserve linearity is worth it. (Then just take square root at the end).

    caveat: I may decide your question is different than what I thought and decide std deviation is nice to work with directly -- rare but it happens.
     
  10. Aug 8, 2018 #9
    Sure, here is the standard deviation:

    $$\sigma = \sqrt{ \frac{\sum_{i=1}^N (x_i-\langle x\rangle)^2}{N-1} }$$

    where $N$ is the sample size, $x_i$ is an individual measurement, and $\langle x\rangle$ is the mean of all measurements.
     
  11. Aug 8, 2018 #10
    As for Jensen's inequality and such, I'm not saying that the variance and standard deviations would rise by the same amount as N increases, only that if one increases the other must do as well and that if one stays constant, the other does as well.
     
  12. Aug 8, 2018 #11

    StoneTemplePython

    User Avatar
    Science Advisor
    Gold Member

    This is getting closer but still doesn't mathematically state your problem. So you have a random variable ##X## with a variances called ##\sigma_X^2##. I think you are talking about sampling -- specifically ##n## iid trials, and you want to estimate ##E\big[X\big]## and ##\sigma_X^2 = E\big[X^2\big] - E\big[X\big]^2##. Is that what your goal is?

    The idea here is you need to estimate the mean ##E\big[X\big]## and the second moment ##E\big[X^2\big]## or ##E\big[Y\big]## where ##Y = X^2## if you like. I assume both of these exist.

    As you get larger and larger samples, your estimates will concentrate about the mean. Pick a favorite limit law. As estimates concentrate about the mean, the 'distance' between the estimates and the correct value goes down. Hence the variance of your estimate (read: squared 2 norm of difference between estimates and true value) goes down.

    I wouldn't worry about the divide by ##n## vs divide by ##n-1## issue here.
    - - - -
    It could be instructive to work through a fully fleshed problem, both the math and a simulation, involving coin tossing and estimating mean and variance. Since coin tosses have bounded (specifically 0 or 1 only in this case) results, you can get very sharp estimates on concentration about the mean via Chernoff bounds.
     
  13. Aug 8, 2018 #12
    I used the NORMINV[] function in Excel to generate random numbers that are distributed normally about a mean. The distribution is continuous, unlike coin flips. As more and more numbers are generated, the standard deviation doesn't drop. Most say that as I increase the sample size, that the estimates will concentrate closer to the mean. But it doesn't appear that is happening.
     
  14. Aug 8, 2018 #13

    StoneTemplePython

    User Avatar
    Science Advisor
    Gold Member

    I think you're making a mistake (a) using Excel (hard for others to replicate, not known for being good in simulations -- a free excel add-on called PopTools is decent though) and (b) starting with a normal distribution instead of something simpler, and in particular coin tossing (and yes that can be normal approximated but that's a different topic).

    Assuming the moments exist, these limit laws are ironclad things which tells me you're doing something wrong here but I can't read minds. The point is the variance/variation in your estimates comes down for bigger sample sizes. You may consider trying a simulation with ##n=10## data points instead of ##n=500## and comparing the mean and variances estimates as well as the variance in said estimates after running many trials. My gut tells me that you're not calculating things in the way I would, but this is an ironclad difference between pasting in a few lines of code for others to look at, vs working in Excel.

    - - - -
    it also occurs to me that the normal distribution may be too 'nice' and hence you can miss subtle differences. Again with coin tossing in mind, consider a very biased coin that has a value of 1 with probability ##p = 10^{-2}## and a value of 0 aka tails with probability ##1 - p##. Consider the variation in your estimates of the mean and variance of said coin when you run 10,000 trials with each trial having say ##100## tosses, vs running 10,000 trials with each trial having ##10,000## tosses in it.
     
  15. Aug 8, 2018 #14
    Thanks, I'll look into it.
     
  16. Aug 9, 2018 #15

    Stephen Tashi

    User Avatar
    Science Advisor

    You're failing to cope with the complicated vocabulary of statistics.

    That is an estimator for the standard deviation of the mean of N measurements. Some also call it the "sample standard deviation". Other's reserve the term "sample standard deviation" for the similar expression with N instead of N-1 in the denominator.

    The mean of a sample of N measurements is a random variable. It has a standard deviation (namely, the standard deviation of its probability distribution). That standard deviation not a function of the values ##X_i## obtained in one particular sample.

    Which standard deviation are you talking about?

    If ##X## is a random variable with standard deviation ##\sigma_X## then taking 100 samples of ##X## does not change the standard deviation of ##X##, but the random variable ##Y## defined by the mean of 100 samples of ##X## has a smaller standard deviation than ##X##. Neither of these standard deviations is the same as an estimator of a standard deviation. An estimator of a standard deviation is itself a random variable. It isn't a constant parameter associated with a probability distribution.
     
  17. Aug 9, 2018 #16
    Interesting. I appreciate the feedback. So let me ask, how would YOU define the standard deviation?
     
  18. Aug 9, 2018 #17

    Stephen Tashi

    User Avatar
    Science Advisor

    That's like asking "How do you find the place?". It isn't a specific question.

    The use of "the" in the phase "the standard deviation" suggests that it can only refer to a single thing. That is not the case. The phrase "the standard deviation" is unspecific. As @StoneTemplePython said in post #2, standard deviation of what exactly?

    Terms like "standad deviation", "average", "mean" etc. have at least 3 possible meanings

    1. They may refer to speciific number obtained in a specific sample -e.g. "The mean weight in the sample 5 apples was .2 kg"
    2. They may refer to a specific number that gives the value of the parameter associated with a probability distribution "e.g. We assume the population of apples has a normal distribution with mean 0.2"
    3. They may refer to a random variable. For example "The distribution of the mean of a sample of 5 apples taken from a population of apples with mean 0.2 also has a mean of 0.2. (So we can speak of "the mean of a mean", "the standard deviation of a mean", "the standard deviation of a standard deviation" etc.)
     
  19. Aug 9, 2018 #18
    When you wrote "That is an estimator for the standard deviation of the mean of N measurements," what definition were you using?
     
  20. Aug 9, 2018 #19

    Stephen Tashi

    User Avatar
    Science Advisor

    An estimator is a function of the values obtained in a sample. For example,
    $$\sigma = \sqrt{ \frac{\sum_{i=1}^N (x_i-\langle x\rangle)^2}{N-1} }$$
    is an estimator because it depends on the values ##x_i## obtained in a sample.

    Since the practical use of estimators is to estimate the parameters of proability distsributions, we can speak of an "estimator of the standard deviation" or "an estimator of the variance" etc.

    As you know, the mean of sample of N things chosen from a population need not be exactly equal to the mean of the population. The mean of the sample is used as an estimator of the mean of the population. Likewise we can define estimators for the standard deviation and variance.

    A complicated question in statistics is decide what formula is the "best" estimator for a population parameter. More vocabulary is needed in order to be specific about what is meant by "best". There are "unbiased estimators", "minimum variance estimators", "consistent estimators", "maximum liklihood estimators".

    Furthermore , an estimator is itself a random variable because it depends on the random values that occur in a sample. So an estimator has a mean, variance, standard deviation etc. just like other random variables.
     
  21. Aug 9, 2018 #20
    Okay, let's go back to the equation:

    $$\sigma = \sqrt{ \frac{\sum_{i=1}^N (x_i-\langle x\rangle)^2}{N-1} }$$

    Let's not even use the term "standard deviation." Here, N is the number of samples selected from a population. The average in the equation refers to the sample mean. The x_i's are generated from a process modeled by a normal distribution.

    What likely happens to $\sigma$ in the above equation as we increase N, that is, we select more and more samples from the population? Yes, the sample mean changes in value. Understood. If it cannot be said one way or the other (that is, it could go up or down), I'm good with that.
     
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted