# 'Reverse' derive correct dataset when given mean + std dev

Tags:
1. Jun 17, 2015

### fieldmusic123

If I know the mean and standard deviation, but I don't have an original data set, is it possible to reverse calculate a data set that has this known mean and std dev? Is there an excel function I could use to do this easily? I know there could be multiple correct datasets and that's fine, I'm just wondering if it's possible to reverse derive a correct one that has the known mean and std dev.

So, for a MEAN = 100 and STD. DEV= 15, can anyone tell me how to calculate a data set with n=3?

Many thanks.

2. Jun 17, 2015

### FactChecker

Yes, but there are often other properties that you want your data set to have (always positive, from a uniform, normal, or Poison process, etc.) So you usually want more than that. The RAND() EXCEL function gives a number from a uniform distribution between 0 and 1. It has a mean of 0.5 and a variance of 1/12. So multiply and add to get your desired SD and mean. To get a number from a normal distribution, combine RAND and NORMINV ( NORMINV(RAND(), desired_mean, desired_sd) )

3. Jun 17, 2015

### fieldmusic123

Thanks for your reply. I used " NORMINV(RAND(), 100, 15) " in Excel to generate a set of random numbers that look like they would individually fall within a normal distribution for a mean of 100. However, these values, when combined, do not have a mean of 100 (eg. 80, 93, 102). Is there a way I could perhaps use Excel to modify this data set to force it to have a mean of 100 and std dev of 15? Apologies if I am missing something in your answer.

4. Jun 17, 2015

### HallsofIvy

Staff Emeritus
Call your three data points x, y, and z. Then the mean is (x+ y+ z)/3= 100 so x+ y+ z= 300. The variance is (x- 100)^2+ (y- 100)^2+ (z- 100)^2= x^2+ y^2+ z^2- 200(x+ y+ z)+ 30000= 225. That gives two equations to solve for the three unknown values. There are infinite number of such collections of data points. If you assume a normal distribution, which depends only on the two parameters, mean and standard deviation, then, of course, you can do this.

5. Jun 17, 2015

### MrAnchovy

Yes, you could scale it and add an offset (two independent variables) to satisfy the two constraints. But note that the expected value of the sd of a sample is smaller than the sd of the population. For a sample size of 3 a correction factor of approximately $\sqrt{\frac 23}$ is appropriate if I remember correctly.

6. Jun 17, 2015

### fieldmusic123

I used NORMINV(RAND(), 100, 15) in Excel to generate 80, 93, 102. Could you please show me how to scale and add a suitable offset as I'm kind of lost with this?

Huge thanks.

7. Jun 17, 2015

### MrAnchovy

Use the formulas for the mean and standard deviation. So if you scale x, y and z by k and add an offset a, the mean is $\frac{kx+a + ky+a + kz+a}3 = \frac k3 (x+y+z) + a$. Use the appropriate formula for the sd as well and substitute in your numerical values and you will have two simultaneous equations that are easily solved for k and a. I'm not sure the end result has any statistical merit though.

8. Jun 17, 2015

### FactChecker

Oh. I misunderstood.
1) Calculate your sample mean sm = (80+93+102)/3 = 91.6666666666667 and sample standard deviation ssd;
2) Change each data point Xi to X'i = (15/ssd)(Xi - sm) + 100

That should have the correct sample mean and standard deviation.
Note: the standard deviation of the data will have the specified sample standard deviation. If you want to match the true standard deviation, multiply ssd by ((n-1)/n)0.5, where n is the number of data points. It is not clear to me which value would be more appropriate.

9. Jun 17, 2015

### MrAnchovy

Are you sure this is an accurate correction factor for n = 3?

10. Jun 17, 2015

### FactChecker

No, I was wrong. Thanks for the correction. The standard equation for the sample variance (dividing by n-1) is an unbiased estimator for the distribution variance. Taking the square root introduces a bias, so the standard equation for the sample standard deviation (still dividing by n-1) is a biased estimator. But it underestimates σ, so multiplying it by ((n-1)/n)1/2 would make it even worse. (see https://en.wikipedia.org/?title=Unbiased_estimation_of_standard_deviation )

11. Jun 18, 2015

### fieldmusic123

Thanks for you help guys, now I understand