# What are the commonly used estimators in regression models?

• I
• fog37
In summary: If the data follows a quadratic model, the OLS assumptions about the residuals are not met so the OLS estimators would not perform best.
fog37
TL;DR Summary
estimators and OLS
Hello everyone,

I am trying to close the loop on this important topic of estimators.

An estimator is really just a function to calculate point statistics that are close estimates (with low variance) of population parameters. For example, given a set of data, we can compute the mean and the mode. Both the mean and the median are estimators of the center of the data (the mean is better).
• My understanding is that in the case of linear regression, if the key assumptions are met (assumptions mostly about how residuals behave), we can use OLS to find the "best" regression sample coefficients that approximates the regression population coefficients. OLS, if the conditions are met, represent the best, linear, unbiased estimator (BLUE). In the case of linear regression, we have OLS estimators, i.e. a functions to calculate the sample intercept, the sample slope, and the sample correlation coefficient....We could also solve for those same coefficients using Maximum Likelihood (ML), which is another important estimator, but the estimated coefficients would not be the "best" in that case, correct?
• OLS is BLUE, assuming the assumptions are met, as long as the model is "linear in the parameters" so OLS would give the best estimates also for polynomial regression models like ##Y=\beta_0 + \beta_1 X+ \beta_2 X^2##, correct? But would we need to first "convert" the polynomial data into linear data, before applying OLS, taking the ##\sqrt() ## of the independent variable ##X## so the data in the scatterplot follows some sort of linear trend? If the untransformed data follows a quadratic model, the OLS assumptions about the residuals are not met so the OLS estimators would not perform best... So in what sense is OLS generally applicable to linear models?
• What are other commonly used estimators? Maximum likelihood is the important estimator used to estimate coefficients for generalized linear models...Which other estimator is important to know? I read about weighted least-squares which I guess is a variant of OLS
Thanks!

fog37 said:
TL;DR Summary: estimators and OLS

We could also solve for those same coefficients using Maximum Likelihood (ML), which is another important estimator, but the estimated coefficients would not be the "best" in that case, correct?
The OLS is the best (minimum variance) linear unbiased estimator. It is possible for a non-linear estimator to be both better (less variance) than OLS and unbiased. In some scenarios all linear estimators will be bad, so being the best of a bad group is not always good.

I don’t know ML’s properties in this regard.

Note 1: In this post, it might seem that I am being picky about some of your terminology. That is not to criticize your terminology; I am just trying to be specific about my meaning. But the more precise you can be, the better.
Note 2: In this post, I always assume that the ##\epsilon_i##s are an independent, identically distributed sample from a ##N(0, \sigma)## distribution.
fog37 said:
An estimator is really just a function to calculate point statistics that are close estimates (with low variance)
I think you mean the smallest sum-squared-errors for the given sample. Not the lowest variance of the estimator function.
fog37 said:
of population parameters.
For example, given a set of data, we can compute the mean and the mode. Both the mean and the median are estimators of the center of the data (the mean is better).
"better" in what sense? For some uses, the median is better.
fog37 said:
• My understanding is that in the case of linear regression, if the key assumptions are met (assumptions mostly about how residuals behave), we can use OLS to find the "best" regression sample coefficients that approximates the regression population coefficients.
OLS, based on the sample, is the definition of the regression coefficients. IMO, to say that it gives the "best" approximation of the regression coefficients is misleading. If the linear model is correct, OLS gives the "best" approximation of the true linear coefficients from the given sample. (Best in the sense of ML)
fog37 said:
• OLS, if the conditions are met, represent the best, linear, unbiased estimator (BLUE). In the case of linear regression, we have OLS estimators, i.e. a functions to calculate the sample intercept, the sample slope, and the sample correlation coefficient....We could also solve for those same coefficients using Maximum Likelihood (ML), which is another important estimator, but the estimated coefficients would not be the "best" in that case, correct?
Under the usual assumptions, the OLS gives the ML estimator. Minimizing the SSE maximizes the likelihood function, assuming the linear model is correct.
fog37 said:
• OLS is BLUE, assuming the assumptions are met, as long as the model is "linear in the parameters" so OLS would give the best estimates also for polynomial regression models like ##Y=\beta_0 + \beta_1 X+ \beta_2 X^2##, correct?
Correct.
fog37 said:
• But would we need to first "convert" the polynomial data into linear data, before applying OLS, taking the ##\sqrt() ## of the independent variable ##X## so the data in the scatterplot follows some sort of linear trend?
No. The real question should be whether the linear (in coefficients) model ##Y = a_0 + a_1 X + a_2 X^2 + \epsilon## is the correct model. If it is, then minimizing the SSEs is the same as minimizing the calculated values of the ##\epsilon_i##s in that model. Then OLS would give you the ML.
fog37 said:
• If the untransformed data follows a quadratic model, the OLS assumptions about the residuals are not met so the OLS estimators would not perform best..
I disagree. I assume that you mean "best" in the sense of ML.
Remember that, given a correct model ##Y = f(X) + \epsilon##, where ##f(X)## is deterministic, maximizing the likelihood is the same as minimizing the sum of the ##\epsilon_i##s. So regardless of the form of ##f(X)##, the OLS would give the ML.
fog37 said:
• . So in what sense is OLS generally applicable to linear models?
OLS minimizes the sum squares of ##\epsilon_i## in the model ##Y = f(x)+\epsilon##, where ##\epsilon## is the only random component of the true system. In that case, OLS is the ML estimator, regardless of the form of ##f(x)##.
fog37 said:
• What are other commonly used estimators? Maximum likelihood is the important estimator used to estimate coefficients for generalized linear models...Which other estimator is important to know? I read about weighted least-squares which I guess is a variant of OLS
It all depends on the particular application. There are a lot of diverse situations with other considerations than ML, or OLS. Suppose small errors are tolerable, but an error larger than a certain limit is a disaster. Then you might want a model that allows errors smaller than the limit as long as they do not exceed the limit.

Last edited:
fog37 said:
My understanding is that in the case of linear regression, if the key assumptions are met (assumptions mostly about how residuals behave), we can use OLS to find the "best" regression sample coefficients that approximates the regression population coefficients. OLS, if the conditions are met, represent the best, linear, unbiased estimator (BLUE). In the case of linear regression, we have OLS estimators, i.e. a functions to calculate the sample intercept, the sample slope, and the sample correlation coefficient....We could also solve for those same coefficients using Maximum Likelihood (ML), which is another important estimator, but the estimated coefficients would not be the "best" in that case, correct?
That sounds like the Gauss-Markov heorem.
fog37 said:
• OLS is BLUE, assuming the assumptions are met, as long as the model is "linear in the parameters" so OLS would give the best estimates also for polynomial regression models like ##Y=\beta_0 + \beta_1 X+ \beta_2 X^2##, correct?
How are the errors modeled? If it's ##Y = f(X) + ## error then I think you are correct.

If it's ##Y = f(X + ##error ##)## then there is a problem. The Gauss-Markov Theorem would require assuming that the error associated with ##X## be uncorrelated with the error associated with any power of ##X## occuring in the model. Whether that must happen for ##X^2## is an interesting question. If we consider ##X## and ##X^3##, it's hard to imagine a laboratory situation where the error in measuring ##X## would be uncorrelated with an error in measuring ##X^3##. I suppose it would be situation where two different instruments are used to measure the two quantities.

Last edited:
fog37
Stephen Tashi said:
That sounds like the Gauss-Markov heorem.

How are the errors modeled? If it's ##Y = f(X) + ## error then I think you are correct.
If it's ##Y = f(X + ##error ##)## then there is a problem. The Gauss-Markov Theorem would require assuming that the error associated with ##X## be uncorrelated with the error associated with any power of ##X## occuring in the model. Whether that must happen for ##X^2## is an interesting question. If we consider ##X## and ##X^3##, it's hard to imagine a laboratory situation where the error in measuring ##X## would be uncorrelated with an error in measuring ##X^3##. I suppose it would be situation where two different instruments are used to measure the two quantities.
a) I have always seen the data as decomposed into model and random error: DATA = MODEL + ERORR = ##f(X) +\epsilon##. The functional form can be linear w.r.t the coefficients (linear models) or not (nonlinear models). Logistic regression is a "generalized" linear model....

When would the error be part of the functional form as you describe?

Also, you confirm that if the data ##(x,y)## has a quadratic trend, as seen from a scatterplot, the suitable model would be ##Y=b_0 +b_1 X^2## which can be used with OLS to find OLS estimators that give good and reliable estimates of ##b_0## and ##b_1##, always assuming that the OLS key conditions (errors are independent, zero mean, constant variance, etc.). We don't need to "linearize" the data by transforming the variables so that the transformed data has a linear regression model instead of a quadratic polynomial model...So why do we linearize if we can use the quadratic model+OLS? OLS deals with all linear models like polynomials, logarithms, etc.

b) Assuming we detect violations to the key OLS assumptions, corrections are possible. Do they corrections take the OLS estimates and modify them, based on the extent of the violation, making the estimated coefficients more appropriate? Is that how corrections are applied once the violations are detected?

Thank you!

fog37 said:
When would the error be part of the functional form as you describe?
It would take that form if we assume ##X## is not an exact measurement - which I understand is not the situation you want to consider. However, it is a common laboratory situation.

Suppose that ##Y = f(X)## is an exact physical law. A laboratory experiment measures ##X## with a random error of ## \epsilon##. The physical law tells us that the observed ##Y## is ##f(X+ \epsilon)## (assuming no measurement error in ##Y##).

Suppose ##f(X)## is an invertible function. Then fitting the model ##X + \epsilon = f^-1(Y)## is the usual case for regression where the independent variable ##Y## has no error.

fog37 said:
So why do we linearize if we can use the quadratic model+OLS? OLS deals with all linear models like polynomials, logarithms, etc.
You should specify what the situation is when "we" linearize the model. I don't think there is any good answer that fits all situations.
I know of two common situations:
1) When the random errors are a percentage of the ##Y## values. The model is ##Y = e^{\epsilon} *f(X)##. Taking logarithms puts it into the linear form ##\ln(Y) = \ln(f(X)) + \epsilon##, where ##\epsilon## is N(0, ##\sigma##). In this case, least squares can still be used.
2) When the parameters of ##Y = a_0 + a_1 X + \epsilon## are not expected to have constant parameters, ##a_i##, The relationship is locally linearized and there is a linear model with "scheduled" coefficients, ##a_i(z)##, which are functions of some external parameters, ##z##. For fixed values of ##z##, the situation is standard.
fog37 said:
b) Assuming we detect violations to the key OLS assumptions, corrections are possible. Do they corrections take the OLS estimates and modify them, based on the extent of the violation, making the estimated coefficients more appropriate? Is that how corrections are applied once the violations are detected?
I am only familiar with the two situations I mentioned above. There may be many other situations that others can comment on.

In regards to the Gauss-Markov assumptions that the error term must have a mean of zero and a Gaussian distribution, these assumptions are supposed to be at each particular value of ##X##.
But when we check to see if these two conditions are met, we use the data in our sample, we calculate the residuals at each different ##X## value, and take the average of all those residuals for ALL the ##X##. We then plot the distribution of all those same residuals to see if it is normal. This is different than looking at the error's behavior "vertically" at each specific ##X##.

What justifies this approach, using the sample data, to check if the ##E[\epsilon|X]=0## condition and ##\epsilon \approx N( 0, \sigma)## condition are true for each ##X##?

fog37 said:
What justifies this approach, using the sample data, to check if the ##E[\epsilon|X]=0## condition and ##\epsilon \approx N( 0, \sigma)## condition are true for each ##X##?

Whether empirical data "justifies" any definite conclusion is a question about human subjective judgements, unless you want to interpret the process in some precise manner - such as a statistical hypothesis test with an arbitrarily selected "p-value". Even that involves a subjective choice.

From a subjective point of view, looking at the residuals at all the X values falls into the pattern of thinking: If my model M is correct then the data has property P. I will see if P is true, If it isn't, then I will conclude my model is incorrect.

That pattern of thinking doesn't justify the conclusion: If the data has property P then my model is correct.

What are ways to treat the residuals at X values individually? I suppose we could partition the X values into bins and use that division to define a division of the corresponding residuals into bins. Then we would need a statistical test (like chi-square) to guess whether all the bins contain samples from the same distribution.

fog37 said:
In regards to the Gauss-Markov assumptions that the error term must have a mean of zero and a Gaussian distribution, these assumptions are supposed to be at each particular value of ##X##.
But when we check to see if these two conditions are met, we use the data in our sample, we calculate the residuals at each different ##X## value, and take the average of all those residuals for ALL the ##X##. We then plot the distribution of all those same residuals to see if it is normal. This is different than looking at the error's behavior "vertically" at each specific ##X##.

What justifies this approach, using the sample data, to check if the ##E[\epsilon|X]=0## condition and ##\epsilon \approx N( 0, \sigma)## condition are true for each ##X##?
LS regression does not require the errors have a Gaussian distribution, and neither does the GM Theorem.

FactChecker
One other comment: don't conflate LS regression with maximum likelihood: IF you assume the errors follow a Gaussian distribution then so will Y (the response) given the x-values, and in that case LS is maximum likelihood.
Without the assumption of Gaussian errors LS regression has nothing to do with maximum likelihood.

fog37 and FactChecker
Without the assumption of Gaussian errors LS regression has nothing to do with maximum likelihood.
That is a good point to clarify. Although LS regression is logical in minimizing the sum-squared-errors, the standard calculations of anything related to probabilities are not guaranteed to be correct if the probability assumptions that ##\epsilon## is a normal distribution are not met.

fog37
Thank you. Just to understand correctly:

different types of regression apply depending on the nature of the response variable ##Y##. For example, in the case of logistic regression, ##Y## is expected to be a Bernoulli random variable. And I though that in the case of linear regression models, ##Y## was expected to be Gaussian for the model to generate "good" estimates. The GM assumptions, if met, certainly assure that the OLS estimates are reliable....

When ##Y## is not normally distributed, we can still use OLS and find the best-fit line but the estimated coefficients, standard errors, etc. are impacted...

Is that correct?

In the case of logistic regression, the expectation value ##E[Y|X]=p## is transformed into a linear problem by the logit: ##logit(\frac {p}{1-p}) = b_0 +b_1 X##. The goal is now to find the best-fit line between the logit and ##X##, correct? But we cannot use OLS (we use MLE) to find the coefficients ##b_0## and ##b_1##. Is it because the logit does not meet the OLS assumptions (homoskedasticity, etc.)

We can clearly find the the best coefficient for a regression line using OLS, MLE and other estimation methods...OLS gives BLUE estimates if the requirements are met...

Thanks

fog37 said:
Thank you. Just to understand correctly:

different types of regression apply depending on the nature of the response variable ##Y##. For example, in the case of logistic regression, ##Y## is expected to be a Bernoulli random variable. And I though that in the case of linear regression models, ##Y## was expected to be Gaussian for the model to generate "good" estimates. The GM assumptions, if met, certainly assure that the OLS estimates are reliable....
In logistic regression we don't ##\textbf{expect}## the response to be Bernoulli, it must be [at least for binary logistic regression]. That is the only distributional assumption related to logistic regression. Regarding LS regression, none of the core assumptions require a Gaussian distribution anywhere: the only distributional assumptions, on the error term, is that the random errors have mean zero and constant variance not related to the x values [that's true for simple and multiple regression]. Least squares can be applied with or without the assumption of normality in the errors.
fog37 said:
When ##Y## is not normally distributed, we can still use OLS and find the best-fit line but the estimated coefficients, standard errors, etc. are impacted...

Is that correct?
Partially -- see my comment above about the lack of need for a normality in linear regression. Some other comments:
* The response, Y, inherits its distribution from the error distribution: ## Y = \beta_0 + \beta_1 x + \varepsilon##: if the error distribution is normal so is the distribution of Y, if the error distribution is not normal then neither is the distribution of Y
* Regardless of the form of the distribution, once the data are collected and LS applied, there is a single set of estimates generated, and the value do not depend in the least on normality or non-normality. IF it is reasonable to assume the errors are normally distributed THEN you know the sampling distributions of the coefficient estimates are normal, but again, if normality is not assumed in the original model it will still be true that, unless you have some odd data, the sampling distributions are approximately normal due to a form of the Central Limit Theorem
* All of the "good" features of the coefficient estimates (unbiased, for example) are meaningful only if you are sure you've specified the correct model. By that I mean this: Suppose you specify this multiple regression model
$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon$$
Your LS estimates ##\widehat{\beta_0}, \widehat{\beta_1},\widehat{\beta_2}## will be unbiased for the parameter values in the model you specified. However, if there is some other quantity you hadn't considered, so the actual relationship looks like this
$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon$$
your LS estimates no longer estimate the correct things, because you never included the third variable. In short, your estimates are unbiased, but for the wrong things. [Hint: you never know the correct form for the model you want to use]
* Don't worry about whether any set of data is normally distributed: it isn't. Real data never is. Normality is a theoretical ideal that allows us to say "If our data is quite symmetric and has these certain other properties then we can get very reliable results by treating it as though it comes from this theoretical thing we call a normal distribution"
fog37 said:
In the case of logistic regression, the expectation value ##E[Y|X]=p## is transformed into a linear problem by the logit: ##logit(\frac {p}{1-p}) = b_0 +b_1 X##. The goal is now to find the best-fit line between the logit and ##X##, correct? But we cannot use OLS (we use MLE) to find the coefficients ##b_0## and ##b_1##. Is it because the logit does not meet the OLS assumptions (homoskedasticity, etc.)

We can clearly find the the best coefficient for a regression line using OLS, MLE and other estimation methods...OLS gives BLUE estimates if the requirements are met...

Thanks
In logistic regression LS doesn't work because the function we need to optimize (the likelihood function) isn't amenable to LS. It isn't because of lack of validity of assumptions about variability because those don't apply in logistic regression.
Think this [sort of] overview way. Suppose we have a single numerical random quantity, like weight for 12-year old boys, and we want to estimate the mean. We'd like an estimator that shows as little variability as we can get. The sample mean ##\bar x## is used [typically] for a variety of reasons. It is the least squares estimator in this case: it is the number ##a## that minimizes
$$\sum_{i=1}^n \left(x - a\right)^2$$
and it is used when we don't have any other information that might help. But suppose we also have the height of each boy: since it's reasonable to assume weight tends to increase with height it's reasonable to assume that a good description of the mean weight is that it looks like ##\mu = \beta_0 + \beta_1 height##, and estimating the mean with least squares leads to the usual LS regression work.

fog37
In logistic regression we don't ##\textbf{expect}## the response to be Bernoulli, it must be [at least for binary logistic regression]. That is the only distributional assumption related to logistic regression. Regarding LS regression, none of the core assumptions require a Gaussian distribution anywhere: the only distributional assumptions, on the error term, is that the random errors have mean zero and constant variance not related to the x values [that's true for simple and multiple regression]. Least squares can be applied with or without the assumption of normality in the errors.

Partially -- see my comment above about the lack of need for a normality in linear regression. Some other comments:
* The response, Y, inherits its distribution from the error distribution: ## Y = \beta_0 + \beta_1 x + \varepsilon##: if the error distribution is normal so is the distribution of Y, if the error distribution is not normal then neither is the distribution of Y
* Regardless of the form of the distribution, once the data are collected and LS applied, there is a single set of estimates generated, and the value do not depend in the least on normality or non-normality. IF it is reasonable to assume the errors are normally distributed THEN you know the sampling distributions of the coefficient estimates are normal, but again, if normality is not assumed in the original model it will still be true that, unless you have some odd data, the sampling distributions are approximately normal due to a form of the Central Limit Theorem
* All of the "good" features of the coefficient estimates (unbiased, for example) are meaningful only if you are sure you've specified the correct model. By that I mean this: Suppose you specify this multiple regression model
$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon$$
Your LS estimates ##\widehat{\beta_0}, \widehat{\beta_1},\widehat{\beta_2}## will be unbiased for the parameter values in the model you specified. However, if there is some other quantity you hadn't considered, so the actual relationship looks like this
$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon$$
your LS estimates no longer estimate the correct things, because you never included the third variable. In short, your estimates are unbiased, but for the wrong things. [Hint: you never know the correct form for the model you want to use]
* Don't worry about whether any set of data is normally distributed: it isn't. Real data never is. Normality is a theoretical ideal that allows us to say "If our data is quite symmetric and has these certain other properties then we can get very reliable results by treating it as though it comes from this theoretical thing we call a normal distribution"

In logistic regression LS doesn't work because the function we need to optimize (the likelihood function) isn't amenable to LS. It isn't because of lack of validity of assumptions about variability because those don't apply in logistic regression.
Think this [sort of] overview way. Suppose we have a single numerical random quantity, like weight for 12-year old boys, and we want to estimate the mean. We'd like an estimator that shows as little variability as we can get. The sample mean ##\bar x## is used [typically] for a variety of reasons. It is the least squares estimator in this case: it is the number ##a## that minimizes
$$\sum_{i=1}^n \left(x - a\right)^2$$
and it is used when we don't have any other information that might help. But suppose we also have the height of each boy: since it's reasonable to assume weight tends to increase with height it's reasonable to assume that a good description of the mean weight is that it looks like ##\mu = \beta_0 + \beta_1 height##, and estimating the mean with least squares leads to the usual LS regression work.
I see. I guess I see the normality requirement too often that I think it is imperative. For example, t-tests and ANOVA seem to also have normality as a requirement....which I interpret as the collected sample needing to have a normal distribution and good q-q plot to be able to use the t-test on the data...

As far as the logit ##logit(p) = b_0 +b_1X_1 + b_2 X_2##, I figured that since it represents a linear relationship between ##logit(p)## and the predictors ##X_1 , X_2##, a linear relationship is the business of linear regression, then OLS would have to applied but as we know it doesn't and MLE does...

fog37 said:
I see. I guess I see the normality requirement too often that I think it is imperative. For example, t-tests and ANOVA seem to also have normality as a requirement....which I interpret as the collected sample needing to have a normal distribution and good q-q plot to be able to use the t-test on the data...

As far as the logit ##logit(p) = b_0 +b_1X_1 + b_2 X_2##, I figured that since it represents a linear relationship between ##logit(p)## and the predictors ##X_1 , X_2##, a linear relationship is the business of linear regression, then OLS would have to applied but as we know it doesn't and MLE does...
Personal correction:

A t-test can be applied to a sample which doesn't have a normal histogram as long as the sample is large (n>30). If the sample comes from a normal distribution, even better.
However, if the sample is small (N<30) and it comes from a non-normal distribution, hence the sample histogram is not normal, the t-test is not valid and appropriate. A t-test is invalid for small samples from non-normal distributions, and valid for large samples from non-normal distributions.

Last edited:
fog37 said:
Personal correction:

A t-test can be applied to a sample which doesn't have a normal histogram as long as the sample is large (n>30). If the sample comes from a normal distribution, even better.
With caution: that "CLT kicks in when $$n \ge 30$$" message is one of the worst ones around. If your histogram (or boxplot, or both) is skewed relying on any procedure based on the mean and standard deviation is risky, as both of those are highly non-robust.
fog37 said:
However, if the sample is small (N<30) and it comes from a non-normal distribution, hence the sample histogram is not normal, the t-test is not valid and appropriate. A t-test is invalid for small samples from non-normal distributions, and valid for large samples from non-normal distributions.
It's best not to think in these ideas as being absolute rules. T-tests should be applied with caution, with their pros and cons clearly understood.
And, if you perform hypothesis testing, the same statement applies to thinking about p-values: don't treat the "reject if p < .05, don't reject if p >= .05" line as a commandment. The classic objection is: what if you have two tests for the same thing, with the same sample size, and one sample test gives p = .049, the other p =.051 -- those are essentially the same, so why should the decisions for the two of them be fundamentally different?

In R, I noticed that generalized linear models are fit using the ##glm()## function and specifying the error distribution (Binomial, Poisson, etc.). When fitting a linear regression model, the default is "Gaussian" which means that the assumption is that the error is normalm hence the response variable ##Y## will also be normal distribution. That is why normality is also stuck in my mind. But as we have mentioned, being normal is not a necessary requirement for OLS to give good estimates in the case of linear regression...

Examples:
$$\texttt{lm(y ~ x1 + x2) = glm(y ~ x1 + x2, family=gaussian)}$$
$$\texttt{glm(y ~ x1 + x2, family=gaussian(link="log"))}$$

Yes, the use of lm() and glm() you show first will give the same results for the same data: the reason the "family = gaussian" argument is needed has to do with the inner workings of the glm() function, nothing else.

fog37 said:
But as we have mentioned, being normal is not a necessary requirement for OLS to give good estimates in the case of linear regression...
One of my most basic, yet important lessons in statistics is that "goodness" of an estimator is a multidimensional quantity. It should be consistent better asymptotically unbiased even better finite sample unbiased. Furthermore efficient and robust. Usually you cannot have all. There is even a kind of uncertainty relation between bias and efficiency (Cramer Rao inequality, https://en.wikipedia.org/wiki/Cramér–Rao_bound).

FactChecker said:
"Under the usual assumptions, the OLS gives the ML estimator. Minimizing the SSE maximizes the likelihood function, assuming the linear model is correct."
No: OLS is the maximum likelihood estimator only if we assume errors are normally distributed, and the traditional assumptions for regression do not require that.

Last edited by a moderator:
No: OLS is the maximum likelihood estimator only if we assume errors are normally distributed, and the traditional assumptions for regression do not require that.
The Central limit theorem forces a lot of variables to be approximately normally distributed.
Practically all of the statistical and probability measures associated with a linear regression are based on the assumption of a normally distributed error term.

Last edited:
No: OLS is the maximum likelihood estimator only if we assume errors are normally distributed, and the traditional assumptions for regression do not require that.
If errors are far from normal then you cant make sense of the T-values of the estimators, so it becomes meaningless. The most common source of non-normality stems from a heteroskedastic Y variable - for example, regressing predictor variables against the S&P 500 index value rather than its log-return.

Another common issue where OLS <> ML is serial correlation

Both of these should be addressed either in the data set by differencing and/or using a technique like GMM

Last edited:
FactChecker said:
The Central limit theorem forces a lot of variables to be approximately normally distributed.
Practically all of the statistical and probability measures associated with a linear regression are based on the assumption of a normally distributed error term.
I know about the CLT, but that doesn't apply here: the original statement was that OLS estimates are maximum likelihood, and that's only true in the case of the assumption of the errors being gaussian, and that is not a requirement in the usual assumptions about regression.

• Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
7
Views
646
• Set Theory, Logic, Probability, Statistics
Replies
7
Views
653
• Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
• Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
8
Views
2K
• Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
• Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
574