# Linear regression and random variables

• I
fog37
Hello,
I have a question about linear regression models and correlation. My understanding is that our finite set of data ##(x,y)## represents a random sample from a much larger population. Each pair is an observation in the sample.

We find, using OLS, the best fit line and its coefficients and run some statistical tests (t-test and F-test) to check the coefficients' statistical significance. The ultimate goal is to estimate with precision the population slope and intercept.

Does each pair ##(x,y)## represent the realization of a bivariate random variable ##Z=(X,Y)## with Gaussian joint distribution? In the regression analysis, are both ##X## and ##Y## random variables or only the variable ##Y## is random? A random variable has its possible values and associated probabilities. Two random variables ##X## and ##Y## are said to be jointly normal if ##aX+bY## has a normal distribution.

That said, how to we get to the linear model ## y =\beta_1 x +\beta_0## considering ##X## and ##Y## as both random variables?

Thank you!

Homework Helper
Gold Member
Linear regression is applied to the model ##y = \beta_1 x +\beta_0 + \epsilon##, where ##\epsilon## has a Normal distribution with mean 0. The independent ##x## values are not assumed to come from a random variable, but they can be.

• fog37
Mentor
As @FactChecker said, the usual model is ##y=\beta_1 x+\beta_0+\epsilon## where ##\epsilon \sim \mathcal{N}(0,\sigma)## but this is completely equivalent to the model ##y\sim \mathcal{N}(\beta_1 x +\beta_0,\sigma)##. So if you prefer to think in terms of random variables then you certainly can. In fact, that equivalent model is often used in Bayesian statistics.

• fog37 and FactChecker
Does each pair ##(x,y)## represent the realization of a bivariate random variable ##Z=(X,Y)## with Gaussian joint distribution? In the regression analysis, are both ##X## and ##Y## random variables or only the variable ##Y## is random?
To repeat what others have said, the assumptions behind the linear regression model and associated OLS procedure say that the X values have no random errors, so the (X,Y) data is not from the realizations of a bivariate normal random variable.

• fog37
fog37
As @FactChecker said, the usual model is ##y=\beta_1 x+\beta_0+\epsilon## where ##\epsilon \sim \mathcal{N}(0,\sigma)## but this is completely equivalent to the model ##y\sim \mathcal{N}(\beta_1 x +\beta_0,\sigma)##. So if you prefer to think in terms of random variables then you certainly can. In fact, that equivalent model is often used in Bayesian statistics.
Hello Dale,
The sample data, i.e. all the available pairs ##(x,y)##, are modelled as following:
##y=\beta_1 x+\beta_0+\epsilon##

##Y## is a random variable and its expectation value of Y is ##E[Y|X] = \beta_1 x+ \beta_0##.

The regression model that we compute generates estimates of ##\beta_1## and ##\beta_0## which are ## \hat{\beta_1}## and ## \hat{\beta_0}##.

The regression model itself is ##\hat{\beta_1} x+\hat{\beta_0}##.

Does that mean that the regression model estimates the mean of ##Y## and not ##Y## itself?

We use the regression model ##y_{pred}= \hat{\beta_1} x+\hat{\beta_0}## for predictions of the ##y## values though...

Homework Helper
Gold Member
Does that mean that the regression model estimates the mean of ##Y## and not ##Y## itself?

We use the regression model ##y_{pred}= \hat{\beta_1} x+\hat{\beta_0}## for predictions of the ##y## values though...
It means that the regression model estimates the mean of ##Y## given that ##X=x##.

Mentor
Hmm, I don’t know. Does the regression model not include the error term too? I actually don’t know the right terminology here.

• fog37
fog37
Hmm, I don’t know. Does the regression model not include the error term too? I actually don’t know the right terminology here.
Yes, I am not sure. Books present the linear model as a tool for estimating the value of ##y##, not the mean of ##y##.

Homework Helper
Gold Member
It depends on which results of the regression you use.
If you are just looking for a simple curve fit, ##\hat{y} = \beta_0 + \beta_1 x##, through a more complicated, non-random relationship, ##y = f(x)##, that minimizes the sum-squared errors, then you are estimating the value of ##y(x)## as a deterministic function of ##x##. In that case, any regression results regarding probabilities or statistical significance are not meaningful.

On the other hand, If you are assuming the model ##Y = \beta_0 + \beta_1 x + \epsilon## where ##\epsilon
\sim N(0,\sigma)##, then you are assuming that there is a random component of ##Y##. In that case, your regression result is estimating the mean of ##Y##, given the ##x## value. In that case, any regression results regarding probabilities or statistical significance are meaningful.

Last edited:
• fog37 and Dale
Staff Emeritus
Gold Member
Yes, I am not sure. Books present the linear model as a tool for estimating the value of ##y##, not the mean of ##y##.

The mean of y is a good estimate for the value of y if you have no other information...

Mentor
Yes, I am not sure. Books present the linear model as a tool for estimating the value of ##y##, not the mean of ##y##.
I know that Wikipedia isn't authoritative, but Wikipedia says:

"A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data "
https://en.wikipedia.org/wiki/Statistical_model

So it seems like Wikipedia includes the random component as part of the statistical model since the random component is part of what is used to generate the sample data.

Please do not take either my comments or Wikipedia's as authoritative. If I could, I would post this under a non-mentor pseudonym to avoid exaggerating my credibility on this. But I am leaning towards considering the whole thing to be the statistical model, not just the best fit function.

• FactChecker
Homework Helper
Gold Member
I struggled with how to distinguish between the statistical mean, ##\hat{Y}## versus the deterministic, ##y(x)##. Of course, for a repeatable experiment, given a fixed variable, x, the question would be whether the result, y, always gives the same value or varies randomly. But what about something that is not repeatable, such as values (e.g. daily temperature highs) versus calendar dates? For that, the x values can not be repeated. Conversely, what about something that we would consider deterministic, but the details are so complicated that we might consider them random? So I guess the best I can do is to refer to how one decides to model the process and whether a random term, ##\epsilon## is included in the model.
I don't know if there is a good study or reference on this issue. My thoughts on it seem rather amateurish.

• Dale
Mentor
My thoughts on it seem rather amateurish.
Mine too, but your reasoning sounds good.

• FactChecker
##Y## is a random variable and its expectation value of Y is ##E[Y|X] = \beta_1 x+ \beta_0##.

Better notation would be ## E{Y| X = x) ##.

The value of Y given a specific value of X is a random variable. But considering that X has various possible values, it isn't precise to say that "Y is a random variable". In the model, there is a set of random variables. For each value of X=x, we have a different random variable ##Y_x##.
The regression model that we compute generates estimates of ##\beta_1## and ##\beta_0## which are ## \hat{\beta_1}## and ## \hat{\beta_0}##.

The regression model itself is ##\hat{\beta_1} x+\hat{\beta_0}##.

Does that mean that the regression model estimates the mean of ##Y## and not ##Y## itself?
Since Y|X=x is a random variable, how can we interpret the concept of "estimating Y itself"? Do you mean generating a set of data that follows the distribution of Y?

We use the regression model ##y_{pred}= \hat{\beta_1} x+\hat{\beta_0}## for predictions of the ##y## values though...

The term "estimator" in mathematical statistics can refer to any function of the data. (Thus an "estimator" itself is a random variable when the data is from random variables). The term "estimate" can refer to one specific value of an estimator that results from one specific set of data. Whether a particular function estimates a particular parameter of a model is a subjective question - it has to do with the intentions of the person using the model. Furthermore, how well a estimator estimates a parameter is also a subjective question because there are various ways to quantify the utility or dis-utility of estimates.

In linear regression, the dis-utility of an estimated value of Y|X=x is measured by the square of the difference between the estimated value and an observed value. This is a subjective choice. For example, another measure would be the absolute value of that difference. Yet another measure might be the percentage difference.

Since the linear regression model involves a set of random variables, we can't say what the "best" estimated values of its parameters are until we say how to condense the measures of dis-utility for each of the different random variables into a single number. If you look at how linear regression does it, it treats all the ##Y_x## variables as having equal importance and estimates their average dis-utility , effectively giving each possible value of X an equal importance. In a pratctical situation where some values of X can be less frequent or less important, that might not be the "best" way of doing things.

• jim mcnamara
Homework Helper
Hello,
I have a question about linear regression models and correlation. My understanding is that our finite set of data ##(x,y)## represents a random sample from a much larger population. Each pair is an observation in the sample.

We find, using OLS, the best fit line and its coefficients and run some statistical tests (t-test and F-test) to check the coefficients' statistical significance. The ultimate goal is to estimate with precision the population slope and intercept.

Does each pair ##(x,y)## represent the realization of a bivariate random variable ##Z=(X,Y)## with Gaussian joint distribution? In the regression analysis, are both ##X## and ##Y## random variables or only the variable ##Y## is random? A random variable has its possible values and associated probabilities. Two random variables ##X## and ##Y## are said to be jointly normal if ##aX+bY## has a normal distribution.

That said, how to we get to the linear model ## y =\beta_1 x +\beta_0## considering ##X## and ##Y## as both random variables?

Thank you!
"Does each pair represent the realization of a bivariate random variable with Gaussian joint distribution?"
Not classically, no. First of all, the assumption of a Gaussian distribution is not part of those required for regression, and when it's made it doesn't apply to the response but to the error distribution.
If you assume both response and predictor are random the regression model is typically viewed as saying the conditional expected value of Y given x.

"In the regression analysis, are both and random variables or only the variable is random?"
As noted above, traditionally only Y is considered random.

"A random variable has its possible values and associated probabilities. Two random variables and are said to be jointly normal if aX + bY has a normal distribution."
You're missing a bit here: you need to add the statement "for all real numbers, a, b".

Homework Helper
Gold Member
To repeat what others have said, the assumptions behind the linear regression model and associated OLS procedure say that the X values have no random errors, so the (X,Y) data is not from the realizations of a bivariate normal random variable.
I think we need to distinguish between a random, but perfectly accurate ##X## value versus a random error in our estimated value of ##X##. The first case is within the scope of traditional linear regression. The second case is different. I do not know enough about that case to discuss it.

Gold Member
Hello,
I have a question about linear regression models and correlation. My understanding is that our finite set of data ##(x,y)## represents a random sample from a much larger population. Each pair is an observation in the sample.

We find, using OLS, the best fit line and its coefficients and run some statistical tests (t-test and F-test) to check the coefficients' statistical significance. The ultimate goal is to estimate with precision the population slope and intercept.

Does each pair ##(x,y)## represent the realization of a bivariate random variable ##Z=(X,Y)## with Gaussian joint distribution? In the regression analysis, are both ##X## and ##Y## random variables or only the variable ##Y## is random? A random variable has its possible values and associated probabilities. Two random variables ##X## and ##Y## are said to be jointly normal if ##aX+bY## has a normal distribution.

That said, how to we get to the linear model ## y =\beta_1 x +\beta_0## considering ##X## and ##Y## as both random variables?

Thank you!
It may be both. If you want to , e.g., measure weight at X=1,2,... years of age, then X is not random.

Homework Helper
Gold Member
Suppose you are using linear regression to fit the data to the model ##y=\beta_1 x + \beta_0 + \epsilon##.
If the ##x## values are known with no errors, then we know that linear regression works fine.

On the other hand, suppose that the measured ##x## values have some errors and that ##X_{measured} = \alpha_1 X_{actual} + \alpha_0 + \epsilon_X##. Then linear regression would give a result like
##Y=\beta_1 X_{measured} + \beta_0 + \epsilon##
## = \beta_1(\alpha_1 X_{actual} +\alpha_0 + \epsilon_X)+ \beta_0+ \epsilon##
## = (\beta_1\alpha_1) X_{actual} + (\beta_1\alpha_0+\beta_0) + (\beta_1\epsilon_X + \epsilon)##.
So it is still a valid process, but it is estimating ##Y## based on the measured ##X## value. That may be what you really want. But if you are trying to get the theoretical relationship between ##Y## and ##X_{actual}##, it might not be a good model to use.

Homework Helper
Hmm, I don’t know. Does the regression model not include the error term too? I actually don’t know the right terminology here.
The theoretical model, when written out, states the error term.
$$Y = \beta_0 + \beta_1 \, x + \varepsilon$$

Once you have collected data and estimated the intercept and slope all the quantities are known entities: no error term involved

$$\hat{y} = \widehat{\beta_0} + \widehat{\beta_1} \, x$$

• fog37
Homework Helper
The theoretical model, when written out, states the error term.
$$Y = \beta_0 + \beta_1 \, x + \varepsilon$$

Once you have collected data and estimated the intercept and slope all the quantities are known entities: no error term involved

$$\hat{y} = \widehat{\beta_0} + \widehat{\beta_1} \, x$$
Going a little further: if you assume both Y and X are random it’s typical to assume the underlying model looks like this.
Assume that there are distributions F, M [I‘m also going to assume they have densities: that isn’t strictly required but it makes the exposition a little easier. Notice nothing is said here about either being a Gaussian distribution] such that
$$h(x,y) = f\left(y - \left(\beta_0 + \beta_1 x\right)\right)m\left(x\right)$$
so that the conditional distribution of Y given X = x depends on x. Usually it is assumed that both F and M have finite second moments with F symmetric about zero so that the conditional expectation of Y given X = x is
$$E\left(Y \mid X=x\right) = \beta_0 + \beta_1 \, x$$

In this sense all of the interpretations drawn from the regression equation are conditional.
This model easily generalizes to the multivariate case as well: M is assumed to be a multivariate distribution with positive definite covariance matrix, F has the same assumptions as above, and WHOOSH you have a joint distribution that’s the same form as above.

Gold Member
The theoretical model, when written out, states the error term.

$$Y = \beta_0 + \beta_1 \, x + \varepsilon$$

Once you have collected data and estimated the intercept and slope all the quantities are known entities: no error term involved

$$\hat{y} = \widehat{\beta_0} + \wideh at{\beta_1} \, x$$

So Y^ is an unbiased estimator?

Homework Helper
If the predicted (fitted) y-values are ever referred to as unbiased estimators I've never encountered it. We say the estimated coefficients are unbiased estimators of the population coefficients [the parameters of the model] since the estimates have sampling distributions and the expected value of those distributions equal the appropriate parameters. There isn't a parameter that corresponds to yhat, so I'm not sure what quantity you would consider it to be unbiased for.

• fog37