Creating a confounding variable

  • Context: Graduate 
  • Thread starter Thread starter FallenApple
  • Start date Start date
  • Tags Tags
    Variable
Click For Summary
SUMMARY

This discussion focuses on the impact of confounding variables in linear regression analysis, specifically using the variables Y, X1, and X2. The user generated Y and X1 from a multivariate normal distribution and manipulated X2 to be similar to X1. Three linear regressions were performed: lm(Y~X1) showed X1 as statistically significant, lm(Y~X2) showed X2 as significant, but lm(Y~X1+X2) revealed X1 as significant while X2 was not, indicating that X1 confounds the relationship between X2 and Y. The discussion highlights the importance of understanding how regression algorithms minimize the sum of squares to determine coefficient significance.

PREREQUISITES
  • Understanding of linear regression analysis
  • Familiarity with multivariate normal distribution
  • Knowledge of statistical significance and confidence intervals
  • Experience with statistical software for regression modeling
NEXT STEPS
  • Research the concept of confounding variables in statistical modeling
  • Learn about multicollinearity and its effects on regression analysis
  • Explore the use of R for performing linear regression and interpreting results
  • Study the implications of coefficient estimation and hypothesis testing in regression
USEFUL FOR

Statisticians, data analysts, and researchers involved in regression analysis, particularly those interested in understanding the effects of confounding variables on statistical significance.

FallenApple
Messages
564
Reaction score
61
So I have Y, the response and X1 and X2. I generate Y and X1 from a multivariate normal distribution. Then I manually set X2 to be nearly same as X1( the same except for the fact that I change up a few entries to make X2 distinct from X1).

I ran three separate linear regressions.

lm(Y~X1) -> X1 statistically significant

lm(Y~X2)-> X2 statistically significant

lm(Y~X1+X2)-> X1 statistically significant and X2 not statistically significant.

I suppose this makes sense. X1 is clearly confounds the relation between X2 and Y since X1 is causally related to X2 and to Y. But I'm not so clear as to what is mathematically going on. How do the algorithms detect this? Does it have something to do with holding X1 constant while interpreting X2?
 
  • Like
Likes   Reactions: ZeGato
Physics news on Phys.org
The algorithm selects coefficients c1 and c2 and intercept c0 so as to minimise the sum of squares of (Y - (c0 + c1 X1 + c2 X2)).
Because the fit between X1 and Y is better than between X2 and Y, it will choose a high absolute value coefficient for X1 and a low one for X2. So the confidence interval for the estimator of c2, given the null hypothesis that the true value of the coefficient is zero, will include the actual estimate, meaning that it is not statistically significant.
 
Last edited:
  • Like
Likes   Reactions: FallenApple and ZeGato

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
6K
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 11 ·
Replies
11
Views
8K
  • · Replies 1 ·
Replies
1
Views
5K
Replies
4
Views
3K
  • · Replies 2 ·
Replies
2
Views
1K