Calculating errors and trendline - different correlation coefficients?

1. Oct 26, 2013

chipotleaway

Background:
My lab group has taken a number of measurements of gas levels over 15 minute period every 15 seconds. We actually used a gas sensor and computer interface to do it so it pretty much did all the work. In calculating the errors for the data, using the uncertainties given in the user manuals of device, we got unreasonably large errors as the uncertainties on the devices are rather large (20%). So it was suggested that we use linear regression (?) to calculate the trendline for the data and the corresponding errors.

The formulas I'm using to find the trendline and errors are from *, and it involves equations (3) and (7) from the paper.
*http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
These give the respective intercept a and slope b. Then I used equation (15) to find the correlation coefficient:

$$\r=b\frac{S_x}{S_y}$$

b is the value found from the big formula in (7), $S_x$ and $S_y$ are the standard deviations in the dependent and independent data sets.

This was needed to calculate the standard error using formula (17), but I ran into some trouble because formula (15) gave a correlation coefficient of larger than |1| for some of the data sets, and when you plug this into (17), you get a negative square root.

So I did a quick search and found an alternative formula from http://www.ditutor.com/regression/correlation_coefficient.html
$$r=\frac{cov(x,y)}{S_xS_y}$$
which gave correlation coefficents of less than |1| for all the data sets.

But the thing is, for some of the data sets, both formulas work and the latter gives a much small errors than the former - although both do give very small errors (you can't see them on the graph). For now, I'm using the one which gives the bigger error, but which should I use and why doesn't the correlation coefficient formula in the paper always work?

Also, the trendline calculated using the formula from the paper is also quite poor - there is a very clear linear trend in the data points and the one that I calculated using the formula is a pretty bad fit. Can I just use apply the errors I calculated to my data points and not use the trendline?
I'm not sure I can do that because the paper says that the standard error is in the slope of the regression line, not for the actual data points.

By the way, in case there is a difference between a 'trendline' and a 'regression line' - I mean the same thing in this post as I'm not aware of what the differences are...my apologies to anyone who gets offended by this!

Thanks

2. Oct 27, 2013

Stephen Tashi

Is there an argument based on physics that says the gas level should change linearly with time?

I haven't checked formula 15 for r in the paper, but a correlation coefficient should never be greater than 1.

3. Oct 28, 2013

chipotleaway

We were measuring changes in CO2 and O2 levels in a closed container containing some plants. I don't know but that's we measured.

Yeah..that's why I had to use the second formula in the OP for some of the data sets but for some, both formulas work. The one in the paper gives a bigger error where it works but I'm not sure if it's correct at all

4. Oct 28, 2013

Stephen Tashi

My guess is that you are using one of the formulas incorrectly. Can you find a simple example where the two methods give different values of the correlation coefficient?

5. Oct 29, 2013

chipotleaway

The two formulas give different correlation coefficients for all the data sets, it's most are bigger than |1| (for the first formula).

Here's one where the correlation coefficient is 'acceptable'.

Standard deviation in x = 4.438234
Standard deviation in y = 211.1072

First formula:

b-value = 38.53639
Correlation coefficient = 0.810174

Second formula:
Covariance = 931.2381
Correlation coefficient = 0.993911

And here's an example where the correlation coefficient using the first formula is too big
Standard deviation in x = 4.438234
Standard deviation in y = 741.0841

First formula:
b-value = -1429.57
Correlation coefficient = -8.56147

Second formula:
Covariance = -2824.48
Correlation coefficient = -0.85874

I'm doing all calculations in Excel so I've attached the file incase you want to have a look but I've gone over the formulas several times and can't find any errors. (the correlation coefficient that is too big is the O2 data, the first one in the file).

Attached Files:

• data1.xlsx
File size:
35.2 KB
Views:
51
6. Oct 29, 2013

Stephen Tashi

I don't understand how you got b = -1429.5708415884 from the values var(X) = (4.4382335075
)^2 and cov(X,Y) = -2824.4800243686

7. Oct 29, 2013

chipotleaway

I was told to use equation (7) (the more convoluted formula for b) but I just did a quick test using $b=\frac{Cov(X,Y)}{Var(x)}$ (which I didn't notice until you mentioend var(X)) and the b-value it gives gives the same correlation coefficient as the second.

The slope of the regression/trendline is also much better fit now, except that the constant a is still quite high the line is translated up and does not lie over the data points. Probably another error in there somewhere I'll have to go through it again.

But thanks a lot!

8. Oct 31, 2013

chipotleaway

I cant find anymore errors but when finding the averages in the data points, dividing by n+1 rather than n makes the regression line line up with the data really nicely, so that fixes all the problems so far but I don't why it works.