What is the expected value of Cov(x,y)2 in an independent X and Y scenario?

haruspex · Jan 23, 2019

all the references I can find on the net to justifying a correlation treat it as a matter of judgment, and, quite correctly, that it depends on the application.
But it seems to me that one could compare the fit to the data of a horizontal line (i.e. average y) with that of the linear regression and ask whether the improved fit is better than chance.
The best horizontal fit has error term var(y) (population variance, not sample variance), while the best linear fit is less than that by ##\frac{Cov(x,y)^2}{Var(x)}##.
Now suppose, in reality, X and Y are independent. My question is, if we make some guess about the y distribution, what would be the expected value of Cov(x,y)²?

I tried simplifying by assuming an N(0,1) distribution for Y, and that the mean and variance of the sample match the population, but I get that the expected value of Cov(x,y)² is Var(x)Var(y). This makes no sense to me because it would lead to the expected error term of the linear fit being zero.

Does anyone have a reference for such an analysis? If not, I'll post my algebra and ask where I am going wrong.
I am aware that there is a whole branch of stats that deals with justifying the number of tuning parameters in a mathematical model, but I was looking for something simpler to start with.

1plus1is10 · Jan 23, 2019

I'm guessing this youtube video might interest you (start at 4:00):

(regarding Principal Component Analysis)

haruspex · Jan 23, 2019

1plus1is10 said:

I'm guessing this youtube video might interest you (start at 4:00):

(regarding Principal Component Analysis)

It's certainly interesting, thanks, but I don't see the relevance to my question.

FactChecker · Jan 23, 2019

Would you explain what you mean by "justify" a correlation? There are standard algorithms to determine how much luck a certain data correlation would need to have that value or larger. If that what you mean by "justify", then it is a statistical question and it can be answered within statistics On the other hand, if "justify" means determining a cause and effect relationship, that is strictly in the provence of the particular subject matter and is not a statistical issue.

haruspex · Jan 23, 2019

FactChecker said:

standard algorithms to determine how much luck a certain data correlation would need to have that value or larger

That's what I'm looking for.

FactChecker · Jan 24, 2019

Then a better word than "justify" would probably be "evaluate". As you suggest, you can use regression analysis to determine whether the data correlation is just a matter of luck or if that amount of correlation would take so much luck that a sceptical person would have to admit that there must be something behind the correlation. A linear regression, ## Y = b + a*X##, would only have a statistically significant value of ##a## if the ##a*X## term improved the prediction of ##Y## a statistically significant amount. A forward stepwise regression would not include that term at all unless it was statistically significant. One can control what level of "statistical significance" is required.

But one must be careful. Two completely unrelated things can be strongly correlated simply because they have a similar trend. For instance, the population of Ireland may be strongly correlated with the distance of the spacecraft OSIRIS-REx from Earth simply because both are increasing with time. So one should have a subject-matter understanding of the situation before drawing any inferences from a correlation.

haruspex · Jan 24, 2019

FactChecker said:

Then a better word than "justify" would probably be "evaluate". As you suggest, you can use regression analysis to determine whether the data correlation is just a matter of luck or if that amount of correlation would take so much luck that a sceptical person would have to admit that there must be something behind the correlation. A linear regression, ## Y = b + a*X##, would only have a statistically significant value of ##a## if the ##a*X## term improved the prediction of ##Y## a statistically significant amount. A forward stepwise regression would not include that term at all unless it was statistically significant. One can control what level of "statistical significance" is required.

But one must be careful. Two completely unrelated things can be strongly correlated simply because they have a similar trend. For instance, the population of Ireland may be strongly correlated with the distance of the spacecraft OSIRIS-REx from Earth simply because both are increasing with time. So one should have a subject-matter understanding of the situation before drawing any inferences from a correlation.

Yes, I understand all that, but I am asking what should be a deterministic mathematical question.
Given that the variables are, in fact, independent, and, if necessary, given the actual distribution of Y (N(0,1), say, or uniform if that is easier), what is the expected value of (cov(x,y))²?

1plus1is10 · Jan 24, 2019

haruspex said:

I am asking what should be a deterministic mathematical question

To focus on Math...
Linear Regression has a slope of: cov(x,y)/var(x)
Correlation is: cov(x,y)/sqrt(var(x)*var(y))
So why are you calculating cov(x,y)²

haruspex · Jan 24, 2019

1plus1is10 said:

To focus on Math...
Linear Regression has a slope of: cov(x,y)/var(x)
Correlation is: cov(x,y)/sqrt(var(x)*var(y))
So why are you calculating cov(x,y)²

I thought I explained it in post #1.
The best horizontal fit through the data has (mean of squares of residual) error term var(y).
The best linear fit reduces that error by ##\frac{Cov^2(x,y)}{Var(x)}##.
I am interested in how much reduction should be expected by chance.

WWGD · Jan 24, 2019

Only thing I can think of is that if/when one uses controlled, double-blinding, etc. and observes the correlation, this increases the (informal) odds that the correlation is not just by chance.

FactChecker · Jan 24, 2019

haruspex said:

Yes, I understand all that, but I am asking what should be a deterministic mathematical question.
Given that the variables are, in fact, independent, and, if necessary, given the actual distribution of Y (N(0,1), say, or uniform if that is easier), what is the expected value of (cov(x,y))²?

If the variables are statistically independent, the expected value of the covariance is 0.

haruspex · Jan 24, 2019

FactChecker said:

If the variables are statistically independent, the expected value of the covariance is 0.

Yes, but the expected value of the square of the covariance is the question.

StoneTemplePython · Jan 24, 2019

haruspex said:

If not, I'll post my algebra and ask where I am going wrong.
I am aware that there is a whole branch of stats that deals with justifying the number of tuning parameters in a mathematical model, but I was looking for something simpler to start with.

I'd suggest posting the algebra given the responses you've gotten from others. On my end, I'm still not sure what the below means

haruspex said:

Now suppose, in reality, X and Y are independent. My question is, if we make some guess about the y distribution, what would be the expected value of Cov(x,y)²?

since covariance itself involves an expectation. (I do have a hunch that I know what you're computing, but still putting down the math to eliminate guesswork seems prudent.)

haruspex · Jan 24, 2019

StoneTemplePython said:

since covariance itself involves an expectation

Yes, I think this is what makes it hard to explain.
The covariance here is defined in terms of a given dataset. If I generate many such datasets with the same x values but different y values then each produces a different covariance. The average covariance will tend to zero, but the average of their squares will not.
Writing the algebra on this ipad is painful, so I'll do that when I get onto a laptop.

FactChecker · Jan 24, 2019

haruspex said:

Yes, but the expected value of the square of the covariance is the question.

Oh. Sorry. I don't think that I can help on that.

haruspex · Jan 24, 2019

haruspex said:

I'll post my algebra

Well, thanks for making me write this out properly, @FactChecker. As so often happens, it has pinpointed my blunder.

For simplicity, say Y is N(0,σ) and that in the samples ##\bar x=\bar y=0## and ##Var(y_i)=\sigma^2##.
This should not completely invalidate the result, just distort it some. Specifically, the expected value of cov² should still be nonzero, but not so great that the variance of the residuals in the linear regression is zero (which would imply all datapoints in a straight line).
##n^2Cov^2(x,y)=(\Sigma_ix_iy_i)^2##
##n^2E(Cov^2(x,y))=\int_{y_1}..\int_{y_n}(\Sigma_i x_i y_i)^2\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
##=\int_{y_1}..\int_{y_n}(\Sigma_{j>i} 2x_i y_ix_jy_j+\Sigma_i x_i^2 y_i^2)\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
I'll split this into two sets of terms. A term in the first set looks like, for some j>i:
##\int_{y_1}..\int_{y_n}2x_i y_ix_jy_j\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
Performing the integral wrt some y_k where k is neither i nor j just produces a factor 1.
Performing the integral wrt y_i:
##\int_{y_i}2x_i y_ix_jy_j\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_i^2}{2\sigma^2}}.dy_i=2x_i \mu_yx_jy_j=0##
So all terms in the first set disappear.
This leaves terms like
##\int_{y_1}..\int_{y_n}x_i^2 y_i^2\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
Again, for k≠i, the integration has no effect. That leaves us with terms
##\int_{y_i}x_i^2 y_i^2\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_i^2}{2\sigma^2}}.dy_i##
##=x_i^2\sigma^2##
Reinstating the summation:
##n^2E(Cov^2(x,y))=\Sigma_ix_i^2\sigma^2=\sigma^2\Sigma_ix_i^2##
##nE(Cov^2(x,y))=\sigma^2Var(x)=Var(x)Var(y)##
(In my scrawled notes I had been been omitting the n² factor, so did not notice that only one of them cancels. There is an n remaining.)

Hence the expected reduction in the error term is ##\frac{E(Cov^2(x,y))}{Var(x)}= \frac 1nVar(y)##.
To recap, error term (i.e. mean sum square of residuals) with horizontal line fit is Var(y), and the expected value of it for an unconstrained linear fit is ##\frac{n-1}nVar(y)##.

I have a nasty feeling I have simply rediscovered, in another guise, the well-known relationship between the variance of the sample and the least biased estimate of the variance of the population!

BWV · Jan 25, 2019

There is a standard significance test for the null of rho=0, where N is the sample size which give a standard t - valuet = r / sqrt[(1—r^2)/(N—2)]

http://vassarstats.net/rsig.html

haruspex · Jan 25, 2019

BWV said:

There is a standard significance test for the null of rho=0, where N is the sample size which give a standard t - valuet = r / sqrt[(1—r^2)/(N—2)]

http://vassarstats.net/rsig.html

Thanks.

What is the expected value of Cov(x,y)2 in an independent X and Y scenario?

Similar threads

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad A variant of the Monty Hall problem

Undergrad My basic understanding of set theory

High School Onto set mapping is the surjective set mapping, and into injective?

Undergrad How do E[X] and E[|X|] relate?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers