Graduate Creating a confounding variable

  • Thread starter Thread starter FallenApple
  • Start date Start date
  • Tags Tags
    Variable
Click For Summary
In the discussion, a multivariate normal distribution is used to generate response variable Y and predictor X1, while X2 is created to closely resemble X1 with minor alterations. Three linear regressions are performed: Y regressed on X1 shows statistical significance, as does Y regressed on X2, but when both X1 and X2 are included, only X1 remains significant. This indicates that X1 confounds the relationship between X2 and Y due to its causal links. The regression algorithm detects this by minimizing the sum of squares, resulting in a higher coefficient for X1 and a lower one for X2, leading to X2's lack of statistical significance. The findings illustrate the impact of confounding variables in regression analysis.
FallenApple
Messages
564
Reaction score
61
So I have Y, the response and X1 and X2. I generate Y and X1 from a multivariate normal distribution. Then I manually set X2 to be nearly same as X1( the same except for the fact that I change up a few entries to make X2 distinct from X1).

I ran three separate linear regressions.

lm(Y~X1) -> X1 statistically significant

lm(Y~X2)-> X2 statistically significant

lm(Y~X1+X2)-> X1 statistically significant and X2 not statistically significant.

I suppose this makes sense. X1 is clearly confounds the relation between X2 and Y since X1 is causally related to X2 and to Y. But I'm not so clear as to what is mathematically going on. How do the algorithms detect this? Does it have something to do with holding X1 constant while interpreting X2?
 
  • Like
Likes ZeGato
Physics news on Phys.org
The algorithm selects coefficients c1 and c2 and intercept c0 so as to minimise the sum of squares of (Y - (c0 + c1 X1 + c2 X2)).
Because the fit between X1 and Y is better than between X2 and Y, it will choose a high absolute value coefficient for X1 and a low one for X2. So the confidence interval for the estimator of c2, given the null hypothesis that the true value of the coefficient is zero, will include the actual estimate, meaning that it is not statistically significant.
 
Last edited:
  • Like
Likes FallenApple and ZeGato
The standard _A " operator" maps a Null Hypothesis Ho into a decision set { Do not reject:=1 and reject :=0}. In this sense ( HA)_A , makes no sense. Since H0, HA aren't exhaustive, can we find an alternative operator, _A' , so that ( H_A)_A' makes sense? Isn't Pearson Neyman related to this? Hope I'm making sense. Edit: I was motivated by a superficial similarity of the idea with double transposition of matrices M, with ## (M^{T})^{T}=M##, and just wanted to see if it made sense to talk...

Similar threads

  • · Replies 2 ·
Replies
2
Views
1K
Replies
2
Views
6K
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 11 ·
Replies
11
Views
7K
Replies
4
Views
3K
  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 2 ·
Replies
2
Views
1K