'Reverse' derive correct dataset when given mean + std dev

In summary: Xi to X'i = (15/ssd)(Xi - sm) + 100That should have the correct sample mean and standard deviation.Note: the standard deviation of the data will have the specified sample standard deviation.
  • #1
fieldmusic123
7
0
If I know the mean and standard deviation, but I don't have an original data set, is it possible to reverse calculate a data set that has this known mean and std dev? Is there an excel function I could use to do this easily? I know there could be multiple correct datasets and that's fine, I'm just wondering if it's possible to reverse derive a correct one that has the known mean and std dev.

So, for a MEAN = 100 and STD. DEV= 15, can anyone tell me how to calculate a data set with n=3?

Many thanks.
 
Physics news on Phys.org
  • #2
Yes, but there are often other properties that you want your data set to have (always positive, from a uniform, normal, or Poison process, etc.) So you usually want more than that. The RAND() EXCEL function gives a number from a uniform distribution between 0 and 1. It has a mean of 0.5 and a variance of 1/12. So multiply and add to get your desired SD and mean. To get a number from a normal distribution, combine RAND and NORMINV ( NORMINV(RAND(), desired_mean, desired_sd) )
 
  • Like
Likes fieldmusic123
  • #3
FactChecker said:
Yes, but there are often other properties that you want your data set to have (always positive, from a uniform, normal, or Poison process, etc.) So you usually want more than that. The RAND() EXCEL function gives a number from a uniform distribution between 0 and 1. It has a mean of 0.5 and a variance of 1/12. So multiply and add to get your desired SD and mean. To get a number from a normal distribution, combine RAND and NORMINV ( NORMINV(RAND(), desired_mean, desired_sd) )

Thanks for your reply. I used " NORMINV(RAND(), 100, 15) " in Excel to generate a set of random numbers that look like they would individually fall within a normal distribution for a mean of 100. However, these values, when combined, do not have a mean of 100 (eg. 80, 93, 102). Is there a way I could perhaps use Excel to modify this data set to force it to have a mean of 100 and std dev of 15? Apologies if I am missing something in your answer.
 
  • #4
Call your three data points x, y, and z. Then the mean is (x+ y+ z)/3= 100 so x+ y+ z= 300. The variance is (x- 100)^2+ (y- 100)^2+ (z- 100)^2= x^2+ y^2+ z^2- 200(x+ y+ z)+ 30000= 225. That gives two equations to solve for the three unknown values. There are infinite number of such collections of data points. If you assume a normal distribution, which depends only on the two parameters, mean and standard deviation, then, of course, you can do this.
 
  • Like
Likes fieldmusic123
  • #5
fieldmusic123 said:
Is there a way I could perhaps use Excel to modify this data set to force it to have a mean of 100 and std dev of 15?
Yes, you could scale it and add an offset (two independent variables) to satisfy the two constraints. But note that the expected value of the sd of a sample is smaller than the sd of the population. For a sample size of 3 a correction factor of approximately ## \sqrt{\frac 23} ## is appropriate if I remember correctly.
 
  • Like
Likes fieldmusic123
  • #6
MrAnchovy said:
Yes, you could scale it and add an offset (two independent variables) to satisfy the two constraints. But note that the expected value of the sd of a sample is smaller than the sd of the population. For a sample size of 3 a correction factor of approximately ## \sqrt{\frac 23} ## is appropriate if I remember correctly.

Thanks for your reply.

I used NORMINV(RAND(), 100, 15) in Excel to generate 80, 93, 102. Could you please show me how to scale and add a suitable offset as I'm kind of lost with this?

Huge thanks.
 
  • #7
Use the formulas for the mean and standard deviation. So if you scale x, y and z by k and add an offset a, the mean is ## \frac{kx+a + ky+a + kz+a}3 = \frac k3 (x+y+z) + a ##. Use the appropriate formula for the sd as well and substitute in your numerical values and you will have two simultaneous equations that are easily solved for k and a. I'm not sure the end result has any statistical merit though.
 
  • Like
Likes fieldmusic123
  • #8
fieldmusic123 said:
I used NORMINV(RAND(), 100, 15) in Excel to generate 80, 93, 102. Could you please show me how to scale and add a suitable offset as I'm kind of lost with this?
Oh. I misunderstood.
1) Calculate your sample mean sm = (80+93+102)/3 = 91.6666666666667 and sample standard deviation ssd;
2) Change each data point Xi to X'i = (15/ssd)(Xi - sm) + 100

That should have the correct sample mean and standard deviation.
Note: the standard deviation of the data will have the specified sample standard deviation. If you want to match the true standard deviation, multiply ssd by ((n-1)/n)0.5, where n is the number of data points. It is not clear to me which value would be more appropriate.
 
  • Like
Likes fieldmusic123
  • #9
FactChecker said:
If you want to match the true standard deviation, multiply ssd by ((n-1)/n)0.5, where n is the number of data points.
Are you sure this is an accurate correction factor for n = 3?
 
  • Like
Likes fieldmusic123 and FactChecker
  • #10
MrAnchovy said:
Are you sure this is an accurate correction factor for n = 3?
No, I was wrong. Thanks for the correction. The standard equation for the sample variance (dividing by n-1) is an unbiased estimator for the distribution variance. Taking the square root introduces a bias, so the standard equation for the sample standard deviation (still dividing by n-1) is a biased estimator. But it underestimates σ, so multiplying it by ((n-1)/n)1/2 would make it even worse. (see https://en.wikipedia.org/?title=Unbiased_estimation_of_standard_deviation )
 
  • Like
Likes fieldmusic123
  • #11
Thanks for you help guys, now I understand
 

1. How do you reverse derive a correct dataset when given the mean and standard deviation?

In order to reverse derive a correct dataset from the mean and standard deviation, you will need to use the formula for calculating the standard deviation. Once you have the standard deviation, you can then use it to find the range of values that fall within one, two, or three standard deviations from the mean. From there, you can generate a dataset that fits those parameters.

2. Can you reverse derive a dataset if you are only given the mean and standard deviation?

Yes, it is possible to reverse derive a dataset if you are given the mean and standard deviation. As mentioned in the previous answer, you will need to use the standard deviation to determine the range of values that fit within the given parameters. However, it is important to note that there may be multiple datasets that fit the same mean and standard deviation values, so the reverse derived dataset may not be the only possible solution.

3. What other information do you need besides the mean and standard deviation to reverse derive a dataset?

In addition to the mean and standard deviation, you will also need to know the distribution of the dataset. This can include the type of distribution (normal, uniform, etc.), the skewness of the data, and any other relevant information that can help determine the range of values that fit within the given parameters.

4. What are some potential challenges or limitations when reverse deriving a dataset from the mean and standard deviation?

One potential challenge when reverse deriving a dataset is that there may be multiple datasets that fit the same mean and standard deviation values. This can make it difficult to determine the exact dataset that was originally used. Additionally, if the distribution of the dataset is unknown or not specified, it can be challenging to accurately determine the range of values that fit within the given parameters.

5. Are there any tools or software programs that can assist with reverse deriving a dataset?

Yes, there are various statistical software programs that have functions or tools specifically designed for reverse deriving datasets. These programs can help calculate the range of values based on the mean and standard deviation, as well as provide visual representations of the data. However, it is important to understand the underlying principles and formulas used in order to accurately interpret and use these tools.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
29
Views
2K
Replies
5
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
871
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
3K
  • Calculus and Beyond Homework Help
Replies
1
Views
782
  • Calculus and Beyond Homework Help
Replies
3
Views
880
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
780
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
698
Back
Top