Linear regression with error term

Kyouran · Sep 19, 2020

I'm not a statistician, but this has been bothering me for a bit. Suppose we have the simple model

Y= aX + b + U

where Y,X and U are taken to be random variables representing the explanatory variable, the independent variable and the error term respectively.

In the case of a stochastic regressor X, we can write the expected value of the explanatory variable as E[Y] = a +bE[X] (assuming the expected value of the error term U is zero). Since E[X] is a constant parameter of the distribution, E[Y] is also a constant and thus this works, as taking the expected value simply yields a relationship between the means of 2 distributions. So we can take both X and Y each to be a distribution and each observation then constitutes a random pick from these 2 fixed distributions. However, in time series analysis we can sometimes encounter a similar model, e.g. Y_t = a + bX_t + U_t where this is same situation is viewed as a stochastic process instead rather than a relation between 2 random variables.

With a deterministic regressor however, trying the same will give E[Y] = a + bx, and unless x is fixed, this can't be correct since it would imply E[Y] has a changing mean and thus as a result Y cannot be a random variable corresponding to a single, fixed distribution. Thus in the case of a deterministic regressor we have to have at least Y_1...Y_n distributions, one for each value of x chosen, i.e. a model Y_i = a + bx + U_i

So, my questions here are regarding whether I'm viewing this correctly or not:

1) Should I view linear regression with stochastic regressors as being a stochastic process where each observation i corresponds to a realization of a different random variable Y_i in this process, or should I view it (as I argued above) as a simple relation between k + 1 distributions (k being the number of regressors) where each observation is just a different realization from the same set of random variables? In other words, Y = a + bX + U vs Y_i = a + bX_i + U_i ? If both views are possible, are there any implications of choosing one view over another?

2) Is my analysis of deterministic regressors here correct?

BWV · Sep 19, 2020

E[Y] = a + bx works with deterministic x, are you equating deterministic with stationary? The latter can be obtained by differencing. For example if you model Apple’s stock price (Y) relative to the S&P 500(x) price by y=a +bx +e, this is does not have a stable mean, but if you do the same in the first difference (the % change in price, forgetting dividends for the moment) then you do have a stable mean

Kyouran · Sep 19, 2020

BWV said:

E[Y] = a + bx works with deterministic x, are you equating deterministic with stationary? The latter can be obtained by differencing. For example if you model Apple’s stock price (Y) relative to the S&P 500(x) price by y=a +bx +e, this is does not have a stable mean, but if you do the same in the first difference (the % change in price, forgetting dividends for the moment) then you do have a stable mean

I'm aware of differencing, I'm more talking about the subtle theoretical differences in treating the explanatory variable as a single random variable which takes on many different values vs treating it as a part of a stochastic process. Afaik, in the case of differencing, one could say you are trying to create a single random variable with a stable mean from a stochastic process. So it's more of a deeply theoretical question here.

sysprog · Sep 19, 2020

@Kyouran I think that your being able to do multi-variant linear regression analysis with inclusion of consideration of stochastic processes is likely to help to enable you to land a job in a decision support group at a major financial institution :flashlight:

##\dots##

BWV · Sep 19, 2020

Kyouran said:

I'm aware of differencing, I'm more talking about the subtle theoretical differences in treating the explanatory variable as a single random variable which takes on many different values vs treating it as a part of a stochastic process. Afaik, in the case of differencing, one could say you are trying to create a single random variable with a stable mean from a stochastic process. So it's more of a deeply theoretical question here.

In my stock example both Y and X can be represented by the stochastic process of geometric Brownian motion, but other processes may not work. It would need to be a martingale and have normally distributed error terms

Stephen Tashi · Sep 19, 2020

Kyouran said:

Y= aX + b + U

where Y,X and U are taken to be random variables representing the explanatory variable, the independent variable and the error term respectively.

In the usual sort of linear regression (linear least squares) ##X## is not considered a random variable. For example, in a experiment, the experimeter may systematically set ##X## to a sequence of (say) voltages ##X = 1.0, 2.0, 3.0, 4.0, 5.0## and measure the value of ##Y## at those voltages.

Of course, one may speak of the mean of any set of numbers. And one may imagine "picking a number at random" from any set of numbers. However, imagining doing it may have nothing to do with the mathematical model that involves the set of numbers.

So we can take both X and Y each to be a distribution and each observation then constitutes a random pick from these 2 fixed distributions.

It's unlikely that you can model an experiment where ##X## is set systematically that way. For example, if you pick 20 random voltages from the set ##\{1.0,2.0,3.0,4.0,5.0\}## you are unlikely to reproduce an experiment where the experimenter decided to use each voltage exactly 4 times.

There are types of regression where both ##X## and ##Y## are modeled as random variables, in the sense that each are measurements that include random errors. (e.g. "total least squares regression")

However, in time series analysis we can sometimes encounter a similar model, e.g. Y_t = a + bX_t + U_t where this is same situation is viewed as a stochastic process instead rather than a relation between 2 random variables.

What does "X_t" represent? There are types of time series where ##X_t## is a random variable that is a function of values ##X_{t-1}, X_{t-2},...## that happened previously, plus some random disturbance that happens at time ##t##.

A stochastic process assigns probabilities to "trajectories". For example, a stochastic process model of a stocks price over 100 days assigns a probability to each possible graph of that stocks price over the 100 days.

You can imagine picking a day at random and then picking (from all the possible graphs) a price on that day at random. However, for a reasonably sophisticated stochastic process, there need not be any simple relation between the mean day selected, the mean stock value selected, and the constants involve in the model. Perhaps you are thinking of a very simple model such as
##Y(t) = A + tB + U(t)## where ##A,B## are constants and ##U(t)## is a random variable.

Kyouran · Sep 19, 2020

Stephen Tashi said:

In the usual sort of linear regression (linear least squares) ##X## is not considered a random variable.

In classical linear regression this is indeed the case, but it is easy to think of an example from e.g. some of the social sciences where X can be both random yet without measurement error (and hence no total least squares is required). Take for example the age of a randomly selected individual from a population, the number of years they have studied, their wage, etc. You are still randomly selecting the individual, so the regressor is here stochastic.

Stephen Tashi said:

What does "X_t" represent? There are types of time series where ##X_t## is a random variable that is a function of values ##X_{t-1}, X_{t-2},...## that happened previously, plus some random disturbance that happens at time ##t##.

I think I meant to regress one time series onto another, i.e. ##Y_t## onto a time series ##X_t##. Mathematically, the formula looks pretty much the same as that for a regression between X and Y, but at a deeper level they seem to be completely different things.

Let's say that you have a population and select from each individual i their wage ##Y_i## and their age ##X_i##. If you select n individuals, then you could say you have 2n different random variables ##Y_1##...##Y_n## and ##X_1## .. ##X_n##. However, it is clear that ##Y_1 = Y_2 = ... = Y_n = Y##, as they are from the same population, and the same goes for the ##X_i##: they can be seen as realizations of the same ##X##. Because of this, we end up with n equations to solve for the parameters a and b. Also note that the ##i## was just a dummy index, I could've used ##t## instead and mathematically, it would have looked like the exact same formula I gave for the time series example.

Now let's see how this works with 2 time series. You select at each time the price of stock 1 ##(X_t)## and stock 2 ##(Y_t)##. If you select n times, you again have ##Y_1##...##Y_n## and ##X_1##...##X_n## just as in the previous paragraph, but clearly this time we cannot just assume ##Y_1 = Y_2 = ... Y_n##, nor that ##X_1 = X_2 = .. = X_n##.

In the first case, we can estimate a relation between the expected value E[Y] and E[X] because we have a total of n data points for 2 random variables: X and Y. In addition, we find that for the estimated parameters to be unbiased, we simply need to have that E[U|X] = 0. In the second case, ##Y_1## is not necessarily the same as ##Y_2## and we end up with only 1 data point for each pair of random variables ##(X_t, Y_t)## and all we can say is that for some specific value of t, ##E[Y_t] = a_t + b_tE[X_t]##, but not only are the true ##a_t## and ##b_t## parameters different for each pair of ##Y_t## and ##X_t##, but we also have only 1 data point for each such pair. This seems to defeat the whole purpose behind linear regression. So to me it seems we just have to start making assumptions about the underlying process that generates these random variables.

Stephen Tashi said:

A stochastic process assigns probabilities to "trajectories". For example, a stochastic process model of a stocks price over 100 days assigns a probability to each possible graph of that stocks price over the 100 days.

You can imagine picking a day at random and then picking (from all the possible graphs) a price on that day at random. However, for a reasonably sophisticated stochastic process, there need not be any simple relation between the mean day selected, the mean stock value selected, and the constants involve in the model. Perhaps you are thinking of a very simple model such as
##Y(t) = A + tB + U(t)## where ##A,B## are constants and ##U(t)## is a random variable.

For the very simple model here, ##t## would be a deterministic regressor (i.e. you choose the times). But in this case, you have the problem that you have a different distribution ##Y_t## for each value of ##t##. This implies that you only have a single data point to estimate anything about the distribution ##Y_t##, e.g. with ##m## times you have ##m## data points for ##m## distributions. So you have to start making assumptions about the relation between ##Y_t## and ##Y_{t-1}## or you will get nowhere. In fact, assuming a fixed A and B immediately leads to the following assumption between ##Y_t## and ##Y_{t-1}## (apply differencing):

##Y_t = Y_{t-1} + B + U_t - U_{t-1}##

But then you'd have to start justifying that assumption, and it is not as trivial as in the case with the population ages and wages.

Stephen Tashi · Sep 19, 2020

Kyouran said:

Let's say that you have a population and select from each individual i their wage ##Y_i## and their age ##X_i##. If you select n individuals, then you could say you have 2n different random variables ##Y_1##...##Y_n## and ##X_1## .. ##X_n##. However, it is clear that ##Y_1 = Y_2 = ... = Y_n = Y##, as they are from the same population,

If you are assuming a linear regression model of the form ##Y = AX + B + U##. The distribution of the random variable of ##Y## measured when ##X = 1## does not have the same distribution as the random variable ##Y## measured when ##X = 2##.

In what you call stochastic regression, you have n data points selected from the joint distribution of the random vector ##(X,Y)##
If you consider all n data points to be data for a "random variable" then "random variable" must refer to a random vector.

From the joint distribution, we can compute the marginal distribution of ##Y## and we can compute conditional distributions for things like ##(Y|x=2)##.

You can consider all the ##Y## measurements to be from the same population, if you consider the marginal distribution of ##Y## to be the distribution of that population. However the marginal distribution of ##Y## is not specified in a linear regression model.

The linear regression model ##Y_i = AX_i + B + U(i)## does not specify the joint distribution of ##(X,Y)##. Hence it does not specify the marginal distribution of ##Y##. It does specify conditional distributions like the distribution of ##Y## given that ##X = 2##.

In the second case, ##Y_1## is not necessarily the same as ##Y_2## and we end up with only 1 data point for each pair of random variables ##(X_t, Y_t)##

So to me it seems we just have to start making assumptions about the underlying process that generates these random variables.

Yes, sufficient data to analyze a stochastic processes requires either data from multiple trajectories of the process (not the situation of 1 pair of (X,Y) data per each time t that you described) or some assumption that the formula for generating a trajectory can be estimated from segments of a single trajectory. (e.g. stationary random processes, Box-Jenkins models etc.)

Kyouran · Sep 19, 2020

@Stephen Tashi yes in the example where one draws data from a population you can consider having joint, marginal and conditional distributions (in that specific case one is mostly interested in E[Y|x], not so much E[Y] even though it is well-defined in that particular case), whereas in the time series example the underlying structures are different (e.g. stochastic process), yet linear regression can be applied to both of them.

I suppose sometimes it may be just a point of view as to how everything is considered, but this discussion did help my understanding of it all. Thanks!

Linear regression with error term

What is linear regression with error term?

What is the purpose of the error term in linear regression?

How is the error term calculated in linear regression?

What is the significance of the error term in linear regression?

How can the error term be minimized in linear regression?

Similar threads

Hot Threads

Recent Insights