# Mean and Variance of a data set

Tags:
1. Aug 27, 2016

### DeldotB

1. The problem statement, all variables and given/known data
In this problem we will be generating and analyzing lists of normally distributed random numbers. The distribution we are sampling has true mean 0 and standard deviation 1.

1. If we sample this distribution N=5 times, what do we expect the mean to be? How about the standard deviation? Whats the error on the mean?

2. Relevant equations

$\bar{x}= \Sigma \frac{x_i}{N}$

$s^2= \Sigma \frac{(x_i- \mu)^2}{N}$

3. The attempt at a solution

Im not sure where to go here. What does it mean to have a true mean of zero? What is meant by "true" mean - I havent seen this this phrase used before. I read that if a data distribution is approximately normal then about 68 percent of the data values are within one standard deviation of the mean, but how does this help me when I want to sample this distribution? Any help would be appreciated! I have never taken a statistics class.

2. Aug 27, 2016

### Ray Vickson

Suppose one run of your experiment consists of taking a random sample of size N = 5 from a standard normal distribution (mean = 0, variance = 1). In any run of your experiment, the computed mean of your data set is $\bar{x} = \frac{1}{5}(x_1 + x_2 + x_3 + x_4 + x_5)$, where the $x_i$ constitute your sample of 5 numbers. Note that $\bar{x}$ is itself a sample point from a random variable $\bar{X}$: in one experiment it might = 1.7, in another experiment it might = -0.83, etc., etc. So, $\bar{X}$ itself has some true mean and some true variance; these would be well approximated by repeating the experiment 100,000 times and taking the average and sample variance of your 100,000 $\bar{x}$ values. Remember, however, that for any particular experiment the computed $\bar{x}$ and the computed sample variance $s^2(x)$ will very likely differ at least a bit from the true values of 0 and 1 respectively.

$$s^2= \frac{1}{N} \sum_{i=1}^N (x_i- \mu)^2)$$
is correct only if you pretend you know $\mu$; it is NOT what we usually call the "sample variance". The usual definition of sample variance is that we also estimate $\mu$ from the data as well, so we are dealing with
$$\text{sample variance} = \frac{1}{N-1} \sum_{i=1}^N (x_i - \bar{x})^2,$$
$$\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i .$$
Note that we divide by $N-1$ instead of $N$; the need for doing that arises because we have already "used up" one piece of information when we computed $\bar{x}$, so have left only $N-1$ extra pieces of information that can be used when estimating variance. Theoretically, the true mean of the random variable
$$S = \frac{1}{N-1} \sum_{i=1}^N (x_i -\bar{x})^2$$
is 1, which is the true value of the variance. Had we divided by N instead we would have a random variable with mean $(N-1)/N = 1 - (1/N)$, instead of the true value 1. Of course, for large $N$ it makes hardly any noticeable difference.