What is the expected value of Cov(x,y)2 in an independent X and Y scenario?

  • Context: Undergrad 
  • Thread starter Thread starter haruspex
  • Start date Start date
  • Tags Tags
    correlation modelling
Click For Summary
SUMMARY

The expected value of Cov(X,Y)² in an independent scenario where X and Y are statistically independent is derived to be Var(X)Var(Y). This conclusion arises from the analysis of linear regression, where the covariance's expected value is zero, but its square remains non-zero. The discussion emphasizes the importance of understanding the underlying distributions, such as N(0,σ), and the implications of sample variance on the expected error terms in regression analysis.

PREREQUISITES
  • Understanding of covariance and its properties in statistics.
  • Familiarity with linear regression analysis and its components.
  • Knowledge of statistical independence and its implications on expected values.
  • Basic comprehension of probability distributions, specifically normal distribution N(0,σ).
NEXT STEPS
  • Study the derivation of covariance in the context of independent random variables.
  • Explore the implications of sample variance on regression analysis outcomes.
  • Learn about significance testing for correlation coefficients using t-tests.
  • Investigate the relationship between variance and covariance in statistical modeling.
USEFUL FOR

Statisticians, data analysts, and researchers involved in regression analysis and correlation evaluation will benefit from this discussion, particularly those seeking to understand the mathematical foundations of covariance in independent variable scenarios.

haruspex
Science Advisor
Homework Helper
Insights Author
Gold Member
Messages
42,799
Reaction score
10,495
all the references I can find on the net to justifying a correlation treat it as a matter of judgment, and, quite correctly, that it depends on the application.
But it seems to me that one could compare the fit to the data of a horizontal line (i.e. average y) with that of the linear regression and ask whether the improved fit is better than chance.
The best horizontal fit has error term var(y) (population variance, not sample variance), while the best linear fit is less than that by ##\frac{Cov(x,y)^2}{Var(x)}##.
Now suppose, in reality, X and Y are independent. My question is, if we make some guess about the y distribution, what would be the expected value of Cov(x,y)2?

I tried simplifying by assuming an N(0,1) distribution for Y, and that the mean and variance of the sample match the population, but I get that the expected value of Cov(x,y)2 is Var(x)Var(y). This makes no sense to me because it would lead to the expected error term of the linear fit being zero.

Does anyone have a reference for such an analysis? If not, I'll post my algebra and ask where I am going wrong.
I am aware that there is a whole branch of stats that deals with justifying the number of tuning parameters in a mathematical model, but I was looking for something simpler to start with.
 
Physics news on Phys.org
I'm guessing this youtube video might interest you (start at 4:00):

(regarding Principal Component Analysis)
 
1plus1is10 said:
I'm guessing this youtube video might interest you (start at 4:00):

(regarding Principal Component Analysis)

It's certainly interesting, thanks, but I don't see the relevance to my question.
 
Would you explain what you mean by "justify" a correlation? There are standard algorithms to determine how much luck a certain data correlation would need to have that value or larger. If that what you mean by "justify", then it is a statistical question and it can be answered within statistics On the other hand, if "justify" means determining a cause and effect relationship, that is strictly in the provence of the particular subject matter and is not a statistical issue.
 
FactChecker said:
standard algorithms to determine how much luck a certain data correlation would need to have that value or larger
That's what I'm looking for.
 
Then a better word than "justify" would probably be "evaluate". As you suggest, you can use regression analysis to determine whether the data correlation is just a matter of luck or if that amount of correlation would take so much luck that a sceptical person would have to admit that there must be something behind the correlation. A linear regression, ## Y = b + a*X##, would only have a statistically significant value of ##a## if the ##a*X## term improved the prediction of ##Y## a statistically significant amount. A forward stepwise regression would not include that term at all unless it was statistically significant. One can control what level of "statistical significance" is required.

But one must be careful. Two completely unrelated things can be strongly correlated simply because they have a similar trend. For instance, the population of Ireland may be strongly correlated with the distance of the spacecraft OSIRIS-REx from Earth simply because both are increasing with time. So one should have a subject-matter understanding of the situation before drawing any inferences from a correlation.
 
FactChecker said:
Then a better word than "justify" would probably be "evaluate". As you suggest, you can use regression analysis to determine whether the data correlation is just a matter of luck or if that amount of correlation would take so much luck that a sceptical person would have to admit that there must be something behind the correlation. A linear regression, ## Y = b + a*X##, would only have a statistically significant value of ##a## if the ##a*X## term improved the prediction of ##Y## a statistically significant amount. A forward stepwise regression would not include that term at all unless it was statistically significant. One can control what level of "statistical significance" is required.

But one must be careful. Two completely unrelated things can be strongly correlated simply because they have a similar trend. For instance, the population of Ireland may be strongly correlated with the distance of the spacecraft OSIRIS-REx from Earth simply because both are increasing with time. So one should have a subject-matter understanding of the situation before drawing any inferences from a correlation.
Yes, I understand all that, but I am asking what should be a deterministic mathematical question.
Given that the variables are, in fact, independent, and, if necessary, given the actual distribution of Y (N(0,1), say, or uniform if that is easier), what is the expected value of (cov(x,y))2?
 
haruspex said:
I am asking what should be a deterministic mathematical question
To focus on Math...
Linear Regression has a slope of: cov(x,y)/var(x)
Correlation is: cov(x,y)/sqrt(var(x)*var(y))
So why are you calculating cov(x,y)2
 
1plus1is10 said:
To focus on Math...
Linear Regression has a slope of: cov(x,y)/var(x)
Correlation is: cov(x,y)/sqrt(var(x)*var(y))
So why are you calculating cov(x,y)2
I thought I explained it in post #1.
The best horizontal fit through the data has (mean of squares of residual) error term var(y).
The best linear fit reduces that error by ##\frac{Cov^2(x,y)}{Var(x)}##.
I am interested in how much reduction should be expected by chance.
 
  • #10
Only thing I can think of is that if/when one uses controlled, double-blinding, etc. and observes the correlation, this increases the (informal) odds that the correlation is not just by chance.
 
  • #11
haruspex said:
Yes, I understand all that, but I am asking what should be a deterministic mathematical question.
Given that the variables are, in fact, independent, and, if necessary, given the actual distribution of Y (N(0,1), say, or uniform if that is easier), what is the expected value of (cov(x,y))2?
If the variables are statistically independent, the expected value of the covariance is 0.
 
  • #12
FactChecker said:
If the variables are statistically independent, the expected value of the covariance is 0.
Yes, but the expected value of the square of the covariance is the question.
 
  • #13
haruspex said:
If not, I'll post my algebra and ask where I am going wrong.
I am aware that there is a whole branch of stats that deals with justifying the number of tuning parameters in a mathematical model, but I was looking for something simpler to start with.

I'd suggest posting the algebra given the responses you've gotten from others. On my end, I'm still not sure what the below means

haruspex said:
Now suppose, in reality, X and Y are independent. My question is, if we make some guess about the y distribution, what would be the expected value of Cov(x,y)2?

since covariance itself involves an expectation. (I do have a hunch that I know what you're computing, but still putting down the math to eliminate guesswork seems prudent.)
 
  • #14
StoneTemplePython said:
since covariance itself involves an expectation
Yes, I think this is what makes it hard to explain.
The covariance here is defined in terms of a given dataset. If I generate many such datasets with the same x values but different y values then each produces a different covariance. The average covariance will tend to zero, but the average of their squares will not.
Writing the algebra on this ipad is painful, so I'll do that when I get onto a laptop.
 
  • #15
haruspex said:
Yes, but the expected value of the square of the covariance is the question.
Oh. Sorry. I don't think that I can help on that.
 
  • #16
haruspex said:
I'll post my algebra
Well, thanks for making me write this out properly, @FactChecker. As so often happens, it has pinpointed my blunder.

For simplicity, say Y is N(0,σ) and that in the samples ##\bar x=\bar y=0## and ##Var(y_i)=\sigma^2##.
This should not completely invalidate the result, just distort it some. Specifically, the expected value of cov2 should still be nonzero, but not so great that the variance of the residuals in the linear regression is zero (which would imply all datapoints in a straight line).
##n^2Cov^2(x,y)=(\Sigma_ix_iy_i)^2##
##n^2E(Cov^2(x,y))=\int_{y_1}..\int_{y_n}(\Sigma_i x_i y_i)^2\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
##=\int_{y_1}..\int_{y_n}(\Sigma_{j>i} 2x_i y_ix_jy_j+\Sigma_i x_i^2 y_i^2)\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
I'll split this into two sets of terms. A term in the first set looks like, for some j>i:
##\int_{y_1}..\int_{y_n}2x_i y_ix_jy_j\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
Performing the integral wrt some yk where k is neither i nor j just produces a factor 1.
Performing the integral wrt yi:
##\int_{y_i}2x_i y_ix_jy_j\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_i^2}{2\sigma^2}}.dy_i=2x_i \mu_yx_jy_j=0##
So all terms in the first set disappear.
This leaves terms like
##\int_{y_1}..\int_{y_n}x_i^2 y_i^2\Pi_k (\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_k^2}{2\sigma^2}}.dy_k)##
Again, for k≠i, the integration has no effect. That leaves us with terms
##\int_{y_i}x_i^2 y_i^2\frac 1{\sigma\sqrt{2\pi}}e^{-\frac{y_i^2}{2\sigma^2}}.dy_i##
##=x_i^2\sigma^2##
Reinstating the summation:
##n^2E(Cov^2(x,y))=\Sigma_ix_i^2\sigma^2=\sigma^2\Sigma_ix_i^2##
##nE(Cov^2(x,y))=\sigma^2Var(x)=Var(x)Var(y)##
(In my scrawled notes I had been been omitting the n2 factor, so did not notice that only one of them cancels. There is an n remaining.)

Hence the expected reduction in the error term is ##\frac{E(Cov^2(x,y))}{Var(x)}= \frac 1nVar(y)##.
To recap, error term (i.e. mean sum square of residuals) with horizontal line fit is Var(y), and the expected value of it for an unconstrained linear fit is ##\frac{n-1}nVar(y)##.

I have a nasty feeling I have simply rediscovered, in another guise, the well-known relationship between the variance of the sample and the least biased estimate of the variance of the population!
 
  • #17
There is a standard significance test for the null of rho=0, where N is the sample size which give a standard t - valuet = r / sqrt[(1—r^2)/(N—2)]

http://vassarstats.net/rsig.html
 
  • #18
BWV said:
There is a standard significance test for the null of rho=0, where N is the sample size which give a standard t - valuet = r / sqrt[(1—r^2)/(N—2)]

http://vassarstats.net/rsig.html
Thanks.
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 24 ·
Replies
24
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
4K