# The error between two different standard deviations

1. Aug 18, 2015

Hi,

I have a 2 sets (f and y) of 1000 data points each. Also, each data point corresponds to one in the other set. Essentially, I wanted to compute the standard deviation between the two sets, and I did this:

$\sigma_1 =$ IMAGE 1 (check attachment)

$\sigma_2 =$ IMAGE 2 (check attachment)

$\Delta H$ is simply $f(x_i) - y(x_i)$. This gives me a new set of the differences, and $\bar{H}$ is the average of this new set. As you can see, one computation is the standard deviation between the two sets, while the other computation is the standard deviation of the differences between the two sets.

Now, I am simply wondering if it's possible to know what the error between $\sigma_1$ and $\sigma_2$ is without having to check manually. My apologies if this sounds like an odd inquiry or a very vague request, but these two variables seem like two very different quantities, and I am just a little uncertain on how different the two really are. I've done the two computations for a set of data (with a nearly Gaussian distribution) of my own using this method and my answers are nearly identical (off by 0.003) but I'm fairly sure this is dependent on the data itself.

Any advice is welcome!

#### Attached Files:

File size:
1.6 KB
Views:
56
• ###### Screen Shot 2015-08-18 at 1.04.00 PM.png
File size:
1.4 KB
Views:
64
2. Aug 18, 2015

### Staff: Mentor

What do you mean with "error between $\sigma_1$ and $\sigma_2$"?

3. Aug 18, 2015

### h6ss

What do you mean by that?

I don't understand what you're looking for.

If you're looking to compare both sets, you usually shouldn't work with the difference of both data sets per value. What you want is maybe the difference between a function (for example the mean) of one and the same function of the other, and then look for the standard error of the mean difference, for example. You also need to tell us what you are talking about when you say you want to measure the "error between both standard deviations".

Last edited: Aug 18, 2015
4. Aug 18, 2015

Sorry for being very vague earlier. I was in a bit of a rush but hopefully I can try and expand now. It really is a bit of an odd inquiry, though. I'm fairly sure it's almost certainly dependent on the entries in my sets, but I am wondering if there's anything I am missing here.

I have two sets of data (f and y), each consisting of 1000 entries. So I find the standard deviation like this:

$\sigma_1 = \sqrt {\frac {\sum^{N=1000}_{i=1} (f_i - y_i)^2}{N-1}}$

This seems fairly standard and is simply the standard deviation between the two sets. But I also computed a second quantity by first computing:

$\Delta H_i = f_i - y_i$ which gave me a new set $H$ which is just a 1000 entries that consist of the difference of f and y (i.e. $\Delta H_i$ from i = 1 to 1000). Then I found the average of this new set: $\bar{H}$ which is just equal to $\frac {\sum^{N=1000}_{i=1} \Delta H_i }{N}$

$\sigma_2 = \sqrt {\frac {\sum^{N=1000}_{i=1} (\Delta H_i - \bar{H})^2}{N-1}}$

I guess I'm just wondering if there's a way to understand how different these two measures are for any given set. I am likely just projecting my own biased interpretation, but these two quantities seem very intimate since one is just a measure of the spread of the two values (f and y), while the other is a measure of the spread of the difference of the two values (f and y). I guess I'm just wondering if there's a way to understand how different these two measures are for any given sets. I feel like I'm still being vague, but I'm just curious to see if $\sigma_1$ and $\sigma_2$ actually are similar for any given set, possibly dependent on N.

5. Aug 18, 2015

### Hornbein

sigma_2 is the standard deviation of the differences of the paired values. sigma_1 seems to me to be nothing meaningful. The formula you have used to calculate it is not the formula for standard deviation.

6. Aug 18, 2015

Ahh yes! I should definitely expand and please correct me if something seems wrong. Also, any advice is once again welcome.

I essentially have a scatter plot of values, with $y(x_i)$ being the data points appropriately placed on the x-y axis. Now, I fitted a function to these data points, y, to create a model for the distribution of y. I wanted to find how much my model deviates from the true value (y), though. In such a case, wouldn't $\sigma_1 = \sqrt {\frac {\sum^{N=1000}_{i=1} (f(x_i) - y(x_i))^2}{N-1}}$ be an appropriate measure of deviation between my model and actual data? If not, any particular explanation would be very helpful. If so, then I guess I'm just trying to figure out if $\sigma_1$ or $\sigma_2$ is the better measure, but if both are still perfectly valid and comparable in most cases.

7. Aug 18, 2015

### h6ss

The standard deviation between the two sets makes no sense. No such thing exists. Where did you get this formula from? As I said earlier, you can't just use the difference of the values for both datasets and label it as the standard deviation "between" them. Each dataset has a standard deviation within itself, dealt with individually.

We have the f-set's and the y-set's standard deviations calculated with

$\sigma_f = \sqrt{\frac{1}{1000}\Sigma_{i = 1}^{1000} (f_i - \mu_f)^2}$ and $\sigma_y = \sqrt{\frac{1}{1000}\Sigma_{i = 1}^{1000} (y_i - \mu_y)^2}$,

where $\mu_f$ and $\mu_y$ are the respective means for both sets.

This is why I don't understand the $(f_i-y_i)$ part in your formula. However, the standard deviation that you calculate for $\Delta H$ sounds right, but since $\Delta H=f_i-y_i$, then in the second standard deviation you're just basically calculating the standard deviation using the term $\Delta H-\bar{H}=f_i-y_i-\bar{H}$ which I don't really see the use.

If your goal is to compare both datasets and see if there's a significative difference between them, maybe you should measure the spread of each dataset individually by finding their respective standard deviations and then test for the difference of their standard deviations. Otherwise I don't see the motivation behind comparing the two "formulas" you've stated.

8. Aug 18, 2015

### Hornbein

Aha. Yes, sigma_1 is the better measure. It's not a standard deviation. I don't know what to call it anymore. Sum of squares of the differences, I guess.

I wouldn't call y the true value, I'd call it the measured value. The true value is unknown due to measurement error.

sigma_2 doesn't seem all that useful to me. It will always be less than or equal to sigma_1. It seems to me that there is no reason to subtract H bar. It is a fairly meaningless random variable, I would think, other than telling you whether your function tends to give you a value that is higher or lower than the measured value.

9. Aug 18, 2015

### Hornbein

The key phrase you are looking for is "goodness of fit." sigma_1 is a statistic used to measure goodness of fit.

10. Aug 18, 2015

Yep. Those are all errors on my part. That would be an interesting idea to compare the two standard deviations, but is there no better measure? Is my quantity denoted by $\sigma_1$ not at least descriptive of the difference between these two sets?

I guess my main goal has been to simply calculate the deviation between my measured (y) and estimated (f) values, and I erroneously considered the square root of the sum of the squares to be a standard deviation for some odd reason. I also assumed $\sigma_2$ to be a good measure (if it's mean, $\bar{H}$, equals 0). I guess it is "a" measure, but I'm a little unsure on what exactly it could be appropriately named (as Hornbein stated).

11. Aug 18, 2015

Thank you! Is it unreasonable to consider this goodness of fit value as the error in my expected and measured value?

To reiterate: would it be correct to say that $\sigma_1$ is a measure of the error in approximating y as f? While $\sigma_2$ is the standard deviation of the difference between f and y?

12. Aug 18, 2015

### Hornbein

Sure. Your statistic is well known as the "least squares" metric. It is standard. Just say "I'm using least squares." The smaller the sum of the squares of the differences, the better the fit. (You needn't bother taking the square root, though it seems harmless to me.)

You could also call it "nonlinear regression." That doesn't mean a whole lot, but that is what it is called if f(x) is not a linear function.

sigma_2 is the standard deviation, but to me it doesn't seem all that meaningful or useful.

13. Aug 19, 2015

### BWV

The standard deviation of the difference between two data sets makes sense only if they are somehow related - such as ordered in time. For example if f and y represent output of a sensor at a simultaneous point in time then the standard deviation of their difference is of interest. If there is no common ordering of the sets then the measure is meaningless

14. Aug 19, 2015

### h6ss

Not really, but maybe you'll find more information about what you're looking for here: https://en.wikipedia.org/wiki/Residual_sum_of_squares

That is correct, but again, be careful with how you interpret this information.

15. Aug 19, 2015

Yes, they are ordered and represent a simultaneous reading.

16. Aug 19, 2015

### BWV

The difference between sigma_1 and sigma_2 in the op is

Sig_1 is the Stdev of the difference in readings at each point in time

Sig_2 is the same thing but 'whitened' by extracting the mean. if the mean error is zero then the two measures are identical. This is a common transformation when this data is used for additional or algorithms

17. Sep 8, 2015

### gill1109

Since variance = mean of square minus square of mean = mean of squared difference from mean, the difference between sigma1 squared and sigma2 squared is the square of the mean difference between the two sets of observations. The first one (sigma1 squared) being the larger of the two.