Where did the correlation formula come from?

In summary: With B positive cor(x,x+B)=1, cor(x,-x+B)=-1. Do not worry about the math. Just understand what it means to compute the correlation.In summary, the formula for correlation is a normalization of the covariance formula. It is used to measure the linear relationship between two variables, with values between -1 and 1 indicating the strength and direction of the relationship. The correlation is derived from the Cauchy-Bunyakovsky-Schwarz inequality and can be interpreted as the cosine of an angle in a dot product. The division by the standard deviations ensures that the correlation always stays between -1 and 1. It is important to memorize the formula for the CFA exam,
  • #1
CuriousBanker
190
24
So I am still studying for CFA level one, and it has been years since I took a statistics course.

Anyway, the formula is Correlation=Covariance(x,y)/(standard deviation of x times the standard deviation of y)

It is easy enough to calculate...but where did the formula come from? As in, how was it derived? Also, how do we know it always stays between -1 and +1?

I am just curious, because although variance, standard deviation and covariance are all intuitive to me, and although the concept of correlation is very easy to grasp, for some reason the formula is not making intuitive sense to me.

Thanks in advance.
 
Physics news on Phys.org
  • #2
If you're ok with covariance, then correlation is not so difficult. It's basically just a normalization of the covariance formula. That is, it's just the covariance, but we made sure that it always stays between -1 and 1.

So 0 should indicate no correlation (so independent properties have no correlation, although the converse is false).
A correlation of 1 indicates a linear relationship between x and y with positive slope. So you know (almost surely) that if x is large then y will be large too.
A correlation of -1 indicates a linear relationship between x and y with negative slope. So you know (almost surely) that if x is large, then y will be small.

The correlation then indicates how close you are to these three situations. So if a correlation is .95, then you're very close to a linear relationship. So if x is large, then you're pretty sure that y is large too.

As for why the correlation is between 1 and -1, I think you should just look up the proof in a probability/statistics book. In more mathy texts, they will just say it follows from the Cauchy-Bunyakovsky-Schwarz inequality though. In fact, if you ever studied dot products in geometry, then you have without a doubt seen the formula

[tex]x\cdot y = \|x\|\|y\|\cos\theta[/tex]

Rearranging gives you

[tex]\cos\theta =\frac{x\cdot y}{\|x\|\|y\|}[/tex]

Now, if you interpret the covariance as a dot product and the standard deviations as the norms, then you can interpret the correlation as the cosine of an angle. Please do ignore if this last part makes no sense to you at all. It's not so important, but I think it's a neat interpretation.
 
  • #3
Micro, thanks for always helping.

I already understood what it all meant...and why close to 1 means strongly linear, etc.

The rest, I don't understand at all. I'm thinking maybe I should just memorize the formula for now, and save the deep understanding for when I take my probability/stats classes in a couple of years. What do you think?

Also, besides the proof that it always stays in between -1 and +1, why are we dividing by the product of the standard deviations? Or is that something else I should just memorize now and understand later?

That's the one thing I hate about these kind of licenses/exams...to me, this stuff is meaningless without the proofs and explanations...but whatever, if it's what firms want to see, I'll just do it.
 
  • #4
CuriousBanker said:
Micro, thanks for always helping.

I already understood what it all meant...and why close to 1 means strongly linear, etc.

The rest, I don't understand at all. I'm thinking maybe I should just memorize the formula for now, and save the deep understanding for when I take my probability/stats classes in a couple of years. What do you think?

To be honest, I don't really think there is any deep understanding involved here. The only stuff that might be a bit deep is the connection with inner-product space in linear algebra, and that's not even such an important thing. So yes, just memorize the formula, but I think you already understand it fine. I can't blame you for thinking there is some deep stuff going on here that you don't understand at the moment, and often this will be the case in mathematics, but not here.

Also, besides the proof that it always stays in between -1 and +1, why are we dividing by the product of the standard deviations? Or is that something else I should just memorize now and understand later?

We are dividing precisely so that it stays between -1 and +1. Another reason for me to divide by them is the analogy with the dot product. I don't think there are any other reasons.
 
  • #5
When a line if fit to data by linear regression using "least squares" (as opposed to "total least squares"), the method assumes there can be error in Y measurements, but no error in X. For this reason the regression line of Y as a function of X is usually not the same line as you would get if you did a regression of X as a function of Y ( even if you rotate the graph, i.e. the predicted value of Y given an X by one method need not agree with the predicted value of Y using the other).. You could view the correlation coefficient as an attempt to say something about the linear relation of X and Y without committing to which has errors in measurement.
 
  • #6
Some basic cases should make you comfortable with the equation. Look at the equation in these examples of completely correlated variables: cor(x,x)=1, cor(x,-x)=-1. With A positive, cor(x,A*x)=1, cor(x,-A*x)=-1, and cor(x, A*x+B)=1. At the other extreme of completely independent variables, x and y, cor(x,y)=0.
 

FAQ: Where did the correlation formula come from?

1. What is the correlation formula used for?

The correlation formula is used to measure the strength and direction of the linear relationship between two variables. It helps to determine if there is a positive, negative, or no relationship between the variables.

2. Who developed the correlation formula?

The correlation formula was developed by British statistician and mathematician Karl Pearson in the late 19th century. He published his work in his book "The Grammar of Science" in 1892.

3. How is the correlation formula calculated?

The correlation formula is calculated by dividing the covariance of the two variables by the product of their standard deviations. This results in a value between -1 and 1, where -1 represents a perfect negative correlation, 0 represents no correlation, and 1 represents a perfect positive correlation.

4. Is the correlation formula the only way to measure correlation?

No, there are other ways to measure correlation, such as the Spearman's rank correlation coefficient, Kendall's tau, and the Pearson correlation coefficient using non-parametric methods. These methods are used when the variables do not follow a normal distribution or when the relationship between the variables is not linear.

5. Can the correlation formula be used for all types of data?

The correlation formula is best suited for continuous numerical data. It can also be used for ordinal data, which is data that can be ranked or put into categories, but it may not provide an accurate measure of the relationship between the variables. It is not recommended to use the correlation formula for categorical or nominal data.

Similar threads

Replies
1
Views
1K
Replies
3
Views
1K
Replies
43
Views
4K
Replies
3
Views
957
Replies
5
Views
1K
Replies
5
Views
2K
Replies
7
Views
3K
Replies
15
Views
2K
Back
Top