Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

'Reverse' derive correct dataset when given mean + std dev

  1. Jun 17, 2015 #1
    If I know the mean and standard deviation, but I don't have an original data set, is it possible to reverse calculate a data set that has this known mean and std dev? Is there an excel function I could use to do this easily? I know there could be multiple correct datasets and that's fine, I'm just wondering if it's possible to reverse derive a correct one that has the known mean and std dev.

    So, for a MEAN = 100 and STD. DEV= 15, can anyone tell me how to calculate a data set with n=3?

    Many thanks.
     
  2. jcsd
  3. Jun 17, 2015 #2

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    Yes, but there are often other properties that you want your data set to have (always positive, from a uniform, normal, or Poison process, etc.) So you usually want more than that. The RAND() EXCEL function gives a number from a uniform distribution between 0 and 1. It has a mean of 0.5 and a variance of 1/12. So multiply and add to get your desired SD and mean. To get a number from a normal distribution, combine RAND and NORMINV ( NORMINV(RAND(), desired_mean, desired_sd) )
     
  4. Jun 17, 2015 #3
    Thanks for your reply. I used " NORMINV(RAND(), 100, 15) " in Excel to generate a set of random numbers that look like they would individually fall within a normal distribution for a mean of 100. However, these values, when combined, do not have a mean of 100 (eg. 80, 93, 102). Is there a way I could perhaps use Excel to modify this data set to force it to have a mean of 100 and std dev of 15? Apologies if I am missing something in your answer.
     
  5. Jun 17, 2015 #4

    HallsofIvy

    User Avatar
    Staff Emeritus
    Science Advisor

    Call your three data points x, y, and z. Then the mean is (x+ y+ z)/3= 100 so x+ y+ z= 300. The variance is (x- 100)^2+ (y- 100)^2+ (z- 100)^2= x^2+ y^2+ z^2- 200(x+ y+ z)+ 30000= 225. That gives two equations to solve for the three unknown values. There are infinite number of such collections of data points. If you assume a normal distribution, which depends only on the two parameters, mean and standard deviation, then, of course, you can do this.
     
  6. Jun 17, 2015 #5
    Yes, you could scale it and add an offset (two independent variables) to satisfy the two constraints. But note that the expected value of the sd of a sample is smaller than the sd of the population. For a sample size of 3 a correction factor of approximately ## \sqrt{\frac 23} ## is appropriate if I remember correctly.
     
  7. Jun 17, 2015 #6
    Thanks for your reply.

    I used NORMINV(RAND(), 100, 15) in Excel to generate 80, 93, 102. Could you please show me how to scale and add a suitable offset as I'm kind of lost with this?

    Huge thanks.
     
  8. Jun 17, 2015 #7
    Use the formulas for the mean and standard deviation. So if you scale x, y and z by k and add an offset a, the mean is ## \frac{kx+a + ky+a + kz+a}3 = \frac k3 (x+y+z) + a ##. Use the appropriate formula for the sd as well and substitute in your numerical values and you will have two simultaneous equations that are easily solved for k and a. I'm not sure the end result has any statistical merit though.
     
  9. Jun 17, 2015 #8

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    Oh. I misunderstood.
    1) Calculate your sample mean sm = (80+93+102)/3 = 91.6666666666667 and sample standard deviation ssd;
    2) Change each data point Xi to X'i = (15/ssd)(Xi - sm) + 100

    That should have the correct sample mean and standard deviation.
    Note: the standard deviation of the data will have the specified sample standard deviation. If you want to match the true standard deviation, multiply ssd by ((n-1)/n)0.5, where n is the number of data points. It is not clear to me which value would be more appropriate.
     
  10. Jun 17, 2015 #9
    Are you sure this is an accurate correction factor for n = 3?
     
  11. Jun 17, 2015 #10

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    No, I was wrong. Thanks for the correction. The standard equation for the sample variance (dividing by n-1) is an unbiased estimator for the distribution variance. Taking the square root introduces a bias, so the standard equation for the sample standard deviation (still dividing by n-1) is a biased estimator. But it underestimates σ, so multiplying it by ((n-1)/n)1/2 would make it even worse. (see https://en.wikipedia.org/?title=Unbiased_estimation_of_standard_deviation )
     
  12. Jun 18, 2015 #11
    Thanks for you help guys, now I understand
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: 'Reverse' derive correct dataset when given mean + std dev
  1. Probability without STD? (Replies: 13)

Loading...