Why is the maximum likelihood estimation accurate?

Avatrin · Dec 20, 2017

Hi
I've been googling maximum likelihood estimation. While I do understand how to compute it, I don't understand why maximizing the likelihood function will give us a good estimate of the actual parameter.

In some cases, like the normal distribution, it seems almost obvious. However, in the more general case, I don't know why it is true.

So, I have two questions:
How much knowledge do I need to prove that the maximum of the likelihood function is an estimator of the actual parameter?
Is there a relative intuitive explanation for why this method gives us a good estimate for the actual parameter?

StoneTemplePython · Dec 20, 2017

what would you propose to use instead of maximum likelihood?

Avatrin · Dec 20, 2017

StoneTemplePython said:

what would you propose to use instead of maximum likelihood?

How is that relevant to anything? I am just looking for a formal proof, or, if that doesn't exist, an intuitive explanation for why it should work in the general case.

StoneTemplePython · Dec 20, 2017

Avatrin said:

How is that relevant to anything? I am just looking for a formal proof, or, if that doesn't exist, an intuitive explanation for why it should work in the general case.

Actually you said

Avatrin said:

So, I have two questions:
How much knowledge do I need to prove that the maximum of the likelihood function is an estimator of the actual parameter?
Is there a relative intuitive explanation for why this method gives us a good estimate for the actual parameter?

My question is directly related to prodding you to get an answer to portion that I bolded (i.e. question number 2).

Avatrin · Dec 20, 2017

StoneTemplePython said:

Actually you said
My question is directly related to prodding you to get an answer to portion that I bolded (i.e. question number 2).

Well, I am looking for a relatively intuitive explanation because that makes it easier to get through the formal details afterwards. That's how I got through subjects like topology.

However, if you were just prodding me for an answer... I guess that means this is a case of us using the method because no good alternative exists?

StoneTemplePython · Dec 20, 2017

Avatrin said:

Well, I am looking for a relatively intuitive explanation because that makes it easier to get through the formal details afterwards. That's how I got through subjects like topology.

However, if you were just prodding me for an answer... I guess that means this is a case of us using the method because no good alternative exists?

I'll give a closely related parallel, which is basically how I think about it. Ignore the classical framework for a moment and consider the Bayesian one.

In Bayesian inference you have a prior distribution and likelihood function(s). Without going into much detail, if you have an improper, uniform prior, then your result (posterior) is just showing the effects of the likelihood function. Your result is an entire distribution -- and perhaps multivariable distribution, which is not easy to compress / work with. (Some would say don't compress these distributions, but this can become intractable as you may guess in large scale data projects.)

What kind of summary item would you use to describe the entirety of your distribution? Obviously this is a lossy compression.

Typically people use either MAP (Maximum APosteriori -- i.e. equivalent to maximum likelihood under these special conditions) or LMS (least mean squared error -- i.e. expected value).

The reality is both of these are (relatively) easy to work with, though you could try to come up with something else I suppose.

In some sense this is a very simple idea: minimize a cost function or choose the most likely 'case'.
- - - -

There are some knotty interpretation issues in classical statistics that make the correct interpretation of results something different than the way most people say. The idea of a posterior distribution doesn't make sense in the classical framework. And the idea of an expected value over said non-existent distributions also doesn't. However the idea of honing in on the most likely 'explanation' does (i.e. ##\text{MAP} \to \text{Max Likelihood}##).

Last I checked there are still significant debates on how appropriate it is to use max likelihood in classical stats. But it is at least something, so people use it.

Stephen Tashi · Dec 21, 2017

Avatrin said:

I don't understand why maximizing the likelihood function will give us a good estimate of the actual parameter.

It won't necessarily give you a good estimate.

The intuitive idea of "good" can be translated into a precise mathematical definition in different ways, and different definitions of "good" imply different ways of doing things.

Some examples of different criteria for a "good" estimator are 1) unbiased 2) minimum variance 3) maximum likelihood 4) consistent.

In many situations the maximimum liklihood estimator "asymptotically" has all those properties and the maximum likelihood estimator is conceptually simple since it involves the familiar scenario of trying to find where a function attains a maximum value. That's why one often sees maximum likelihood estimators being used, but whether a maximum likelihood estimator is "good" or not depends on the particulars of a given estimation task.

FactChecker · Dec 22, 2017

Given a sample result, you should look at the MLE as "what is your smartest guess", not as "what is the confidence interval of the true parameter value".

How good the maximum likelihood estimator is depends entirely on unmeasurable luck. Given a result, you are figuring out which population parameter would make that result most likely. If someone flipped a blank coin and told you that the result was heads, you could maximize the likelihood of that result by saying that the coin had heads on both sides. Your saying that doesn't make it true. There is no valid way to assign a probability to the accuracy of the MLE unless you know something about the entire world of all possible populations that this population is one of.

That being said, it is not smart to ignore the maximum likelihood estimator. Sometimes that is all you can do. If you have a lot of data and know enough about the population, then the MLE can be quite good. In the case of the coin toss, getting just one tail on another flip would make the likelihood of a two-headed coin 0.

Avatrin · Dec 22, 2017

FactChecker said:

Given a sample result, you should look at the MLE as "what is your smartest guess", not as "what is the confidence interval of the true parameter value".

How good the maximum likelihood estimator is depends entirely on unmeasurable luck. Given a result, you are figuring out which population parameter would make that result most likely. If someone flipped a blank coin and told you that the result was heads, you could maximize the likelihood of that result by saying that the coin had heads on both sides. Your saying that doesn't make it true. There is no valid way to assign a probability to the accuracy of the MLE unless you know something about the entire world of all possible populations that this population is one of.

That being said, it is not smart to ignore the maximum likelihood estimator. Sometimes that is all you can do. If you have a lot of data and know enough about the population, then the MLE can be quite good. In the case of the coin toss, getting just one tail on another flip would make the likelihood of a two-headed coin 0.

In that case, let me change my question: Why is MLE a smart guess?

Since I originally wrote the OP, I think I have an intuitive understanding of why; We expect the sample distribution to reflect the underlying probability density p(x|p). So, there will be many samples where p(x|p) is high; If we choose the correct parameter, p(x|p) will be high where there are many samples, and so the likelihood function will be a larger number than if the parameter is wrong; In which case mange samples will wind up where p(x|p2) is lower than if we had used the correct parameter.

However, there must be some theoretical underpinning behind the MLE that can give me a better understanding of it. Also, my reasoning above only works for analytic functions which while sufficient for practical applications, cannot give me the understanding I want.

FactChecker · Dec 22, 2017

You may be looking for more "theoretical underpinning" than can be formally proven. Although using the MLE seems smart, any formal proof would require a knowledge of all possible statistical populations and the likelihood of each. That is a big order. It is common in Bayesian techniques to start with the assumption of a uniform probability distribution and to adjust it as data is obtained. But that is a different assumption. Sometimes one has to do what seems smart even if it can not be formally proven or even formally analysed.

Stephen Tashi · Dec 22, 2017

Avatrin said:

So, there will be many samples where p(x|p) is high;

If you want to understand the utility of the maximum likelihood estimator intuitively, you should also try to think of situations where it would not be useful.

Consider this example. Let the unknown parameter be C. Let a family of discrete distributions have the densities given by:

Pr(X = C + 1000) = .1
Pr(X = C + k) = .01 for k = 1,2,...90

If the sample value of X is 4000, the value of the maximum likelihood estimator of C is 3000. However if C is equal to 3000 , there is a probability of 0.9 that the sample value of X will be in the range of 3001 to 3090.

atyy · Dec 24, 2017

http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf has some comments on p6.

Why is the maximum likelihood estimation accurate?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect