# When is it inappropriate to report Pearson correlation coefficients?

1. Aug 29, 2010

### Bacat

I am writing a paper publishing scientific data. My background is chemistry but I have taken a couple of stat classes. In my opinion, one of the biggest deficiencies in modern research publications is the improper use of statistics. I hope to avoid making this mistake, but I need your help.

Most of my data has logarithmic relationships (ie one set of observations (say, mass) plotted vs another set of observations (say, temperature) is best fit with a logarithmic regression line). If I compute Pearson correlation coefficients for these, I get values around 0.80. However, when I plot the data I can see clearly that it is not linear- it is logarithmic. If I fit regression lines to the data I get better R^2 values with a logarithmic fit than a linear fit. Reporting the Pearson number could be misleading because it is a measure of linearity in the data...but I am trying to show that there is a high correlation without claiming that the data is linear.

Can someone with a firm background in statistics theory answer the following questions?

1) Is it appropriate to publish Pearson correlation coefficients for logarithmic data if it is emphasized that the relationship is logarithmic?

2) Does the Pearson number have meaning in the context of logarithmic relationships?

3) Is there a better correlation metric for logarithmic data? Please note that some of the data points are zero. This makes transforming to a logarithmic basis difficult...but maybe there is a trick I'm missing?

Your help is very much appreciated.

2. Aug 30, 2010

### SW VandeCarr

It depends on whether you are plotting original data or transformed data. If the plot of the original data is non linear, you would generally want to transform one or both variables to achieve approximate linearity before calculating Pearson's R.

You can get a R value for any data, but it's best interpreted when there is a approximately linear relation between X1 and X2. When these stats are reported, its always in terms of the the specific transformation(s) used.

If the there are zero values in the originally linear variable, use a transformation of the antilogs on the originally log variable and report as such.

For example, if the antilogs are 1, 10, 100,1000 use the base 10 log transform: 0, 1, 2, 3 . Higher bases are more powerfully linearizing.
This would be reported as log-linear transformed data for the Pearson R.

Last edited: Aug 30, 2010
3. Aug 30, 2010

### Bacat

Thanks for the response!

I'm not familiar with the antilog transform. I think I should make the following transformation:

Date -> Transformed Data

0 -> 1
1 -> 10
2 -> 100
etc..

Is this correct?

Won't this lead to some enormous numbers later? For example:
17 -> 100,000,000,000,000,000

I think I must be doing it wrong...

4. Aug 30, 2010

### SW VandeCarr

Sorry. I wasn't sure what form your original data was in. For example, was X1 already in log form and still non-linear?

Suppose X1 is 1, 10, 100, 1000 and X2 is 0, 1, 2, 3

The best thing is to transform X1 to 0, 1, 2, 3 with a log-linear transform for X1 on X2. This eliminates the problem with 0 in the X2 data. In this example, of course, R=1.

The antilog transform would be transforming X2 to the antilogs (1,10,100,1000). The main problem is that this is not conveniently presented on a graph with a single linear scale. However R can still be calculated with the same result (R=1). Obviously, you can use powers of 10. You will get a linear graph if both axes use the same scaling.

If both data sets were non linear increasing or decreasing, a log-log transform with an appropriately chosen base might be tried.

Last edited: Aug 31, 2010