What is the expected value of Cov(x,y)2 in an independent X and Y scenario?

In summary: The best linear fit has error term var(y) (population variance, not sample variance), while the best horizontal fit has error term var(y) (population variance, not sample variance).
  • #1
haruspex
Science Advisor
Homework Helper
Insights Author
Gold Member
41,384
9,862
all the references I can find on the net to justifying a correlation treat it as a matter of judgment, and, quite correctly, that it depends on the application.
But it seems to me that one could compare the fit to the data of a horizontal line (i.e. average y) with that of the linear regression and ask whether the improved fit is better than chance.
The best horizontal fit has error term var(y) (population variance, not sample variance), while the best linear fit is less than that by ##\frac{Cov(x,y)^2}{Var(x)}##.
Now suppose, in reality, X and Y are independent. My question is, if we make some guess about the y distribution, what would be the expected value of Cov(x,y)2?

I tried simplifying by assuming an N(0,1) distribution for Y, and that the mean and variance of the sample match the population, but I get that the expected value of Cov(x,y)2 is Var(x)Var(y). This makes no sense to me because it would lead to the expected error term of the linear fit being zero.

Does anyone have a reference for such an analysis? If not, I'll post my algebra and ask where I am going wrong.
I am aware that there is a whole branch of stats that deals with justifying the number of tuning parameters in a mathematical model, but I was looking for something simpler to start with.
 
Physics news on Phys.org
  • #2
I'm guessing this youtube video might interest you (start at 4:00):

(regarding Principal Component Analysis)
 
  • #3
1plus1is10 said:
I'm guessing this youtube video might interest you (start at 4:00):

(regarding Principal Component Analysis)

It's certainly interesting, thanks, but I don't see the relevance to my question.
 
  • #4
Would you explain what you mean by "justify" a correlation? There are standard algorithms to determine how much luck a certain data correlation would need to have that value or larger. If that what you mean by "justify", then it is a statistical question and it can be answered within statistics On the other hand, if "justify" means determining a cause and effect relationship, that is strictly in the provence of the particular subject matter and is not a statistical issue.
 
  • #5
FactChecker said:
standard algorithms to determine how much luck a certain data correlation would need to have that value or larger
That's what I'm looking for.
 
  • #6
Then a better word than "justify" would probably be "evaluate". As you suggest, you can use regression analysis to determine whether the data correlation is just a matter of luck or if that amount of correlation would take so much luck that a sceptical person would have to admit that there must be something behind the correlation. A linear regression, ## Y = b + a*X##, would only have a statistically significant value of ##a## if the ##a*X## term improved the prediction of ##Y## a statistically significant amount. A forward stepwise regression would not include that term at all unless it was statistically significant. One can control what level of "statistical significance" is required.

But one must be careful. Two completely unrelated things can be strongly correlated simply because they have a similar trend. For instance, the population of Ireland may be strongly correlated with the distance of the spacecraft OSIRIS-REx from Earth simply because both are increasing with time. So one should have a subject-matter understanding of the situation before drawing any inferences from a correlation.
 
  • #7
FactChecker said:
Then a better word than "justify" would probably be "evaluate". As you suggest, you can use regression analysis to determine whether the data correlation is just a matter of luck or if that amount of correlation would take so much luck that a sceptical person would have to admit that there must be something behind the correlation. A linear regression, ## Y = b + a*X##, would only have a statistically significant value of ##a## if the ##a*X## term improved the prediction of ##Y## a statistically significant amount. A forward stepwise regression would not include that term at all unless it was statistically significant. One can control what level of "statistical significance" is required.

But one must be careful. Two completely unrelated things can be strongly correlated simply because they have a similar trend. For instance, the population of Ireland may be strongly correlated with the distance of the spacecraft OSIRIS-REx from Earth simply because both are increasing with time. So one should have a subject-matter understanding of the situation before drawing any inferences from a correlation.
Yes, I understand all that, but I am asking what should be a deterministic mathematical question.
Given that the variables are, in fact, independent, and, if necessary, given the actual distribution of Y (N(0,1), say, or uniform if that is easier), what is the expected value of (cov(x,y))2?
 
  • #8
haruspex said:
I am asking what should be a deterministic mathematical question
To focus on Math...
Linear Regression has a slope of: cov(x,y)/var(x)
Correlation is: cov(x,y)/sqrt(var(x)*var(y))
So why are you calculating cov(x,y)2
 
  • #9
1plus1is10 said:
To focus on Math...
Linear Regression has a slope of: cov(x,y)/var(x)
Correlation is: cov(x,y)/sqrt(var(x)*var(y))
So why are you calculating cov(x,y)2
I thought I explained it in post #1.
The best horizontal fit through the data has (mean of squares of residual) error term var(y).
The best linear fit reduces that error by ##\frac{Cov^2(x,y)}{Var(x)}##.
I am interested in how much reduction should be expected by chance.
 
  • #10
Only thing I can think of is that if/when one uses controlled, double-blinding, etc. and observes the correlation, this increases the (informal) odds that the correlation is not just by chance.
 
  • #11
haruspex said:
Yes, I understand all that, but I am asking what should be a deterministic mathematical question.
Given that the variables are, in fact, independent, and, if necessary, given the actual distribution of Y (N(0,1), say, or uniform if that is easier), what is the expected value of (cov(x,y))2?
If the variables are statistically independent, the expected value of the covariance is 0.
 
  • #12
FactChecker said:
If the variables are statistically independent, the expected value of the covariance is 0.
Yes, but the expected value of the square of the covariance is the question.
 
  • #13
haruspex said:
If not, I'll post my algebra and ask where I am going wrong.
I am aware that there is a whole branch of stats that deals with justifying the number of tuning parameters in a mathematical model, but I was looking for something simpler to start with.

I'd suggest posting the algebra given the responses you've gotten from others. On my end, I'm still not sure what the below means

haruspex said:
Now suppose, in reality, X and Y are independent. My question is, if we make some guess about the y distribution, what would be the expected value of Cov(x,y)2?

since covariance itself involves an expectation. (I do have a hunch that I know what you're computing, but still putting down the math to eliminate guesswork seems prudent.)
 
  • #14
StoneTemplePython said:
since covariance itself involves an expectation
Yes, I think this is what makes it hard to explain.
The covariance here is defined in terms of a given dataset. If I generate many such datasets with the same x values but different y values then each produces a different covariance. The average covariance will tend to zero, but the average of their squares will not.
Writing the algebra on this ipad is painful, so I'll do that when I get onto a laptop.
 
  • #15
haruspex said:
Yes, but the expected value of the square of the covariance is the question.
Oh. Sorry. I don't think that I can help on that.
 
  • #16
haruspex said:
I'll post my algebra
Well, thanks for making me write this out properly, @FactChecker. As so often happens, it has pinpointed my blunder.

For simplicity, say Y is N(0,σ) and that in the samples ##\bar x=\bar y=0## and ##Var(y_i)=\sigma^2##.
This should not completely invalidate the result, just distort it some. Specifically, the expected value of cov2 should still be nonzero, but not so great that the variance of the residuals in the linear regression is zero (which would imply all datapoints in a straight line).
##n^2Cov^2(x,y)=(\Sigma_ix_iy_i)^2##
##n^2E(Cov^2(x,y))=\int_{y_1}..\int_{y_n}(\Sigma_i x_i y_i)^2\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
##=\int_{y_1}..\int_{y_n}(\Sigma_{j>i} 2x_i y_ix_jy_j+\Sigma_i x_i^2 y_i^2)\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
I'll split this into two sets of terms. A term in the first set looks like, for some j>i:
##\int_{y_1}..\int_{y_n}2x_i y_ix_jy_j\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
Performing the integral wrt some yk where k is neither i nor j just produces a factor 1.
Performing the integral wrt yi:
##\int_{y_i}2x_i y_ix_jy_j\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_i^2}{2\sigma^2}}.dy_i=2x_i \mu_yx_jy_j=0##
So all terms in the first set disappear.
This leaves terms like
##\int_{y_1}..\int_{y_n}x_i^2 y_i^2\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
Again, for k≠i, the integration has no effect. That leaves us with terms
##\int_{y_i}x_i^2 y_i^2\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_i^2}{2\sigma^2}}.dy_i##
##=x_i^2\sigma^2##
Reinstating the summation:
##n^2E(Cov^2(x,y))=\Sigma_ix_i^2\sigma^2=\sigma^2\Sigma_ix_i^2##
##nE(Cov^2(x,y))=\sigma^2Var(x)=Var(x)Var(y)##
(In my scrawled notes I had been been omitting the n2 factor, so did not notice that only one of them cancels. There is an n remaining.)

Hence the expected reduction in the error term is ##\frac{E(Cov^2(x,y))}{Var(x)}= \frac 1nVar(y)##.
To recap, error term (i.e. mean sum square of residuals) with horizontal line fit is Var(y), and the expected value of it for an unconstrained linear fit is ##\frac{n-1}nVar(y)##.

I have a nasty feeling I have simply rediscovered, in another guise, the well-known relationship between the variance of the sample and the least biased estimate of the variance of the population!
 
  • #17
There is a standard significance test for the null of rho=0, where N is the sample size which give a standard t - valuet = r / sqrt[(1—r^2)/(N—2)]

http://vassarstats.net/rsig.html
 
  • #18
BWV said:
There is a standard significance test for the null of rho=0, where N is the sample size which give a standard t - valuet = r / sqrt[(1—r^2)/(N—2)]

http://vassarstats.net/rsig.html
Thanks.
 

What is the expected value of Cov(x,y)2 in an independent X and Y scenario?

The expected value of Cov(x,y)2 in an independent X and Y scenario is 0. This is because in an independent scenario, the covariance between two variables is equal to 0. This means that there is no relationship between the two variables and they do not affect each other's values.

How is the expected value of Cov(x,y)2 calculated?

The expected value of Cov(x,y)2 is calculated by multiplying the covariance between x and y by itself. This is because the covariance is a measure of how much two variables vary together, and squaring it gives a measure of how much they vary together relative to their individual variances.

What does a higher expected value of Cov(x,y)2 indicate?

A higher expected value of Cov(x,y)2 indicates a stronger relationship between the two variables. This means that the values of one variable are more likely to be influenced by the values of the other variable, and vice versa.

Can the expected value of Cov(x,y)2 be negative?

No, the expected value of Cov(x,y)2 cannot be negative. This is because the covariance is always a positive or zero value, and when squared, it will always result in a positive value.

How does the expected value of Cov(x,y)2 differ from the expected value of Cov(x,y)?

The expected value of Cov(x,y)2 is the squared version of the expected value of Cov(x,y). This means that the expected value of Cov(x,y)2 gives a measure of the relationship between two variables that is relative to their individual variances, while the expected value of Cov(x,y) gives a measure of the relationship between two variables without taking into account their individual variances.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
454
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
838
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
846
  • Set Theory, Logic, Probability, Statistics
Replies
14
Views
1K
  • Set Theory, Logic, Probability, Statistics
2
Replies
43
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
982
Back
Top