Calculating errors and trendline - different correlation coefficients?

In summary: The gas levels in a closed container of plants changed linearly with time. They found that the correlation coefficient between the two formulas was 0.810174. The slope of the regression line was a much better fit to the data points than the line that was calculated using the formula from the paper. They found that the constant a was quite high and the line was translated up.
  • #1
chipotleaway
174
0
Background:
My lab group has taken a number of measurements of gas levels over 15 minute period every 15 seconds. We actually used a gas sensor and computer interface to do it so it pretty much did all the work. In calculating the errors for the data, using the uncertainties given in the user manuals of device, we got unreasonably large errors as the uncertainties on the devices are rather large (20%). So it was suggested that we use linear regression (?) to calculate the trendline for the data and the corresponding errors.

The formulas I'm using to find the trendline and errors are from *, and it involves equations (3) and (7) from the paper.
*http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
These give the respective intercept a and slope b. Then I used equation (15) to find the correlation coefficient:

[tex]\r=b\frac{S_x}{S_y}[/tex]

b is the value found from the big formula in (7), [itex]S_x[/itex] and [itex]S_y[/itex] are the standard deviations in the dependent and independent data sets.

This was needed to calculate the standard error using formula (17), but I ran into some trouble because formula (15) gave a correlation coefficient of larger than |1| for some of the data sets, and when you plug this into (17), you get a negative square root.

So I did a quick search and found an alternative formula from http://www.ditutor.com/regression/correlation_coefficient.html
[tex]r=\frac{cov(x,y)}{S_xS_y}[/tex]
which gave correlation coefficents of less than |1| for all the data sets.

But the thing is, for some of the data sets, both formulas work and the latter gives a much small errors than the former - although both do give very small errors (you can't see them on the graph). For now, I'm using the one which gives the bigger error, but which should I use and why doesn't the correlation coefficient formula in the paper always work?

Also, the trendline calculated using the formula from the paper is also quite poor - there is a very clear linear trend in the data points and the one that I calculated using the formula is a pretty bad fit. Can I just use apply the errors I calculated to my data points and not use the trendline?
I'm not sure I can do that because the paper says that the standard error is in the slope of the regression line, not for the actual data points.

By the way, in case there is a difference between a 'trendline' and a 'regression line' - I mean the same thing in this post as I'm not aware of what the differences are...my apologies to anyone who gets offended by this!Thanks
 
Physics news on Phys.org
  • #2
Is there an argument based on physics that says the gas level should change linearly with time?

I haven't checked formula 15 for r in the paper, but a correlation coefficient should never be greater than 1.
 
  • #3
We were measuring changes in CO2 and O2 levels in a closed container containing some plants. I don't know but that's we measured.

Yeah..that's why I had to use the second formula in the OP for some of the data sets but for some, both formulas work. The one in the paper gives a bigger error where it works but I'm not sure if it's correct at all
 
  • #4
My guess is that you are using one of the formulas incorrectly. Can you find a simple example where the two methods give different values of the correlation coefficient?
 
  • #5
The two formulas give different correlation coefficients for all the data sets, it's most are bigger than |1| (for the first formula).

Here's one where the correlation coefficient is 'acceptable'.

Standard deviation in x = 4.438234
Standard deviation in y = 211.1072

First formula:

b-value = 38.53639
Correlation coefficient = 0.810174

Second formula:
Covariance = 931.2381
Correlation coefficient = 0.993911


And here's an example where the correlation coefficient using the first formula is too big
Standard deviation in x = 4.438234
Standard deviation in y = 741.0841

First formula:
b-value = -1429.57
Correlation coefficient = -8.56147

Second formula:
Covariance = -2824.48
Correlation coefficient = -0.85874


I'm doing all calculations in Excel so I've attached the file incase you want to have a look but I've gone over the formulas several times and can't find any errors. (the correlation coefficient that is too big is the O2 data, the first one in the file).
 

Attachments

  • data1.xlsx
    35.2 KB · Views: 280
  • #6
I don't understand how you got b = -1429.5708415884 from the values var(X) = (4.4382335075
)^2 and cov(X,Y) = -2824.4800243686
 
  • Like
Likes 1 person
  • #7
I was told to use equation (7) (the more convoluted formula for b) but I just did a quick test using [itex]b=\frac{Cov(X,Y)}{Var(x)}[/itex] (which I didn't notice until you mentioend var(X)) and the b-value it gives gives the same correlation coefficient as the second.

The slope of the regression/trendline is also much better fit now, except that the constant a is still quite high the line is translated up and does not lie over the data points. Probably another error in there somewhere I'll have to go through it again.

But thanks a lot!
 
  • #8
I can't find anymore errors but when finding the averages in the data points, dividing by n+1 rather than n makes the regression line line up with the data really nicely, so that fixes all the problems so far but I don't why it works.
 

1. What are correlation coefficients and why are they important in calculating errors and trendlines?

Correlation coefficients are statistical measures that represent the strength and direction of the relationship between two variables. They are important in calculating errors and trendlines because they help us understand the degree of association between variables and can be used to make predictions and identify patterns.

2. What is the difference between positive and negative correlation coefficients?

A positive correlation coefficient indicates that as one variable increases, the other variable also tends to increase. On the other hand, a negative correlation coefficient suggests that as one variable increases, the other variable tends to decrease. In both cases, the variables are moving in the same direction, but with different strengths.

3. How do you calculate the correlation coefficient?

The most commonly used correlation coefficient is the Pearson's r, which is calculated by dividing the covariance of the two variables by the product of their standard deviations. This value ranges from -1 to 1, where a value of 1 represents a perfect positive correlation, 0 represents no correlation, and -1 represents a perfect negative correlation.

4. How do errors and trendlines affect the correlation coefficient?

Errors can decrease the correlation coefficient, as they introduce noise or random variation in the data. On the other hand, trendlines can increase the correlation coefficient, as they help identify underlying patterns and reduce the effect of outliers. Overall, both errors and trendlines can affect the accuracy and reliability of the correlation coefficient.

5. What is the significance of the correlation coefficient in research and data analysis?

The correlation coefficient is a powerful tool in research and data analysis as it helps us understand the strength and direction of relationships between variables. It allows us to make predictions, identify patterns, and test hypotheses. Additionally, it can be used to compare the strength of relationships between multiple variables and make informed decisions.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
280
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
5K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
2K
  • STEM Educators and Teaching
Replies
11
Views
2K
  • Calculus and Beyond Homework Help
Replies
3
Views
811
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
3K
Back
Top