Calculating errors and trendline - different correlation coefficients?

Click For Summary

Discussion Overview

The discussion revolves around the calculation of errors and correlation coefficients in the context of measuring gas levels over time using a gas sensor. Participants explore the use of linear regression to derive trendlines and the associated uncertainties in their data. The conversation includes technical aspects of statistical formulas and their application to experimental data.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes using a gas sensor to measure gas levels and calculates errors based on device uncertainties, which are large (20%).
  • Another participant questions the validity of a correlation coefficient exceeding 1, suggesting that it should not occur.
  • Some participants note discrepancies between two formulas for calculating the correlation coefficient, with one formula yielding values greater than 1 and the other providing acceptable values.
  • One participant provides specific examples of correlation coefficients calculated using both formulas, highlighting significant differences in results.
  • There is a suggestion that one of the formulas may be used incorrectly, prompting a request for a simple example to illustrate the differences.
  • Another participant mentions that using the second formula for the slope (b) aligns better with the data, although the constant (a) remains problematic.
  • One participant finds that adjusting the method for calculating averages improves the fit of the regression line to the data points, but is unsure why this adjustment works.

Areas of Agreement / Disagreement

Participants express uncertainty regarding the correct application of formulas for correlation coefficients and the implications for their data analysis. There is no consensus on which formula is definitively correct, and multiple competing views on the reliability of the formulas remain unresolved.

Contextual Notes

Participants mention potential errors in calculations and the need for careful consideration of statistical methods. There are unresolved questions about the differences between trendlines and regression lines, as well as the impact of using different methods for calculating averages.

chipotleaway
Messages
174
Reaction score
0
Background:
My lab group has taken a number of measurements of gas levels over 15 minute period every 15 seconds. We actually used a gas sensor and computer interface to do it so it pretty much did all the work. In calculating the errors for the data, using the uncertainties given in the user manuals of device, we got unreasonably large errors as the uncertainties on the devices are rather large (20%). So it was suggested that we use linear regression (?) to calculate the trendline for the data and the corresponding errors.

The formulas I'm using to find the trendline and errors are from *, and it involves equations (3) and (7) from the paper.
*http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
These give the respective intercept a and slope b. Then I used equation (15) to find the correlation coefficient:

\r=b\frac{S_x}{S_y}

b is the value found from the big formula in (7), S_x and S_y are the standard deviations in the dependent and independent data sets.

This was needed to calculate the standard error using formula (17), but I ran into some trouble because formula (15) gave a correlation coefficient of larger than |1| for some of the data sets, and when you plug this into (17), you get a negative square root.

So I did a quick search and found an alternative formula from http://www.ditutor.com/regression/correlation_coefficient.html
r=\frac{cov(x,y)}{S_xS_y}
which gave correlation coefficents of less than |1| for all the data sets.

But the thing is, for some of the data sets, both formulas work and the latter gives a much small errors than the former - although both do give very small errors (you can't see them on the graph). For now, I'm using the one which gives the bigger error, but which should I use and why doesn't the correlation coefficient formula in the paper always work?

Also, the trendline calculated using the formula from the paper is also quite poor - there is a very clear linear trend in the data points and the one that I calculated using the formula is a pretty bad fit. Can I just use apply the errors I calculated to my data points and not use the trendline?
I'm not sure I can do that because the paper says that the standard error is in the slope of the regression line, not for the actual data points.

By the way, in case there is a difference between a 'trendline' and a 'regression line' - I mean the same thing in this post as I'm not aware of what the differences are...my apologies to anyone who gets offended by this!Thanks
 
Physics news on Phys.org
Is there an argument based on physics that says the gas level should change linearly with time?

I haven't checked formula 15 for r in the paper, but a correlation coefficient should never be greater than 1.
 
We were measuring changes in CO2 and O2 levels in a closed container containing some plants. I don't know but that's we measured.

Yeah..that's why I had to use the second formula in the OP for some of the data sets but for some, both formulas work. The one in the paper gives a bigger error where it works but I'm not sure if it's correct at all
 
My guess is that you are using one of the formulas incorrectly. Can you find a simple example where the two methods give different values of the correlation coefficient?
 
The two formulas give different correlation coefficients for all the data sets, it's most are bigger than |1| (for the first formula).

Here's one where the correlation coefficient is 'acceptable'.

Standard deviation in x = 4.438234
Standard deviation in y = 211.1072

First formula:

b-value = 38.53639
Correlation coefficient = 0.810174

Second formula:
Covariance = 931.2381
Correlation coefficient = 0.993911


And here's an example where the correlation coefficient using the first formula is too big
Standard deviation in x = 4.438234
Standard deviation in y = 741.0841

First formula:
b-value = -1429.57
Correlation coefficient = -8.56147

Second formula:
Covariance = -2824.48
Correlation coefficient = -0.85874


I'm doing all calculations in Excel so I've attached the file incase you want to have a look but I've gone over the formulas several times and can't find any errors. (the correlation coefficient that is too big is the O2 data, the first one in the file).
 

Attachments

I don't understand how you got b = -1429.5708415884 from the values var(X) = (4.4382335075
)^2 and cov(X,Y) = -2824.4800243686
 
  • Like
Likes   Reactions: 1 person
I was told to use equation (7) (the more convoluted formula for b) but I just did a quick test using b=\frac{Cov(X,Y)}{Var(x)} (which I didn't notice until you mentioend var(X)) and the b-value it gives gives the same correlation coefficient as the second.

The slope of the regression/trendline is also much better fit now, except that the constant a is still quite high the line is translated up and does not lie over the data points. Probably another error in there somewhere I'll have to go through it again.

But thanks a lot!
 
I can't find anymore errors but when finding the averages in the data points, dividing by n+1 rather than n makes the regression line line up with the data really nicely, so that fixes all the problems so far but I don't why it works.
 

Similar threads

  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 16 ·
Replies
16
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 4 ·
Replies
4
Views
6K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K