A Improving intuition on applying the likelihood ratio test

TheCanadian
Messages
361
Reaction score
13
I am trying to better understand likelihood ratio test and have found a few helpful resources that explicitly solve problems, but was just curious if you have any more to recommend. Links that perhaps work out full problems and also nicely explain the theory. Similar links you have found illuminating for the Wald and Lagrange multiplier tests would also be of much interest!
 
  • Like
Likes K. Doc Holiday
Physics news on Phys.org
Hey TheCanadian.

The likelihood ratios are just probabilities with respect to each other. You have one probability for one hypothesis [given the data] and another hypothesis [given the data].

It would be easier to assess the log likelihood and to understand how the logarithm changes as the probability increases.

You should find that as the probability decreases the negative of the log of the likelihood increases meaning that you get a massive chi-squared statistic and this means that it is not going to be likely that the model fits based on the data you have.

Just remember that the probability is a probability of getting a particular parameter estimate given the sample data [you take sample data and you estimate a parameter based on that sample data].

The Chi-Square distribution is a statistical result of the log-likelihood but the intuition behind interpreting the actual value is that higher probabilities correspond to better likelihood of a hypothesis being true [at least more evidence shown that it is] and you are in essence using the two probabilities to compare [relatively] just how one hypothesis is going to be better with respect to another.

If the log-likelihood is confusing then just think about then the different probabilities are greater or less than each other and how one is either close to zero or close to one.
 
  • Like
Likes WWGD and K. Doc Holiday
chiro said:
Hey TheCanadian.

The likelihood ratios are just probabilities with respect to each other. You have one probability for one hypothesis [given the data] and another hypothesis [given the data].

It would be easier to assess the log likelihood and to understand how the logarithm changes as the probability increases.

You should find that as the probability decreases the negative of the log of the likelihood increases meaning that you get a massive chi-squared statistic and this means that it is not going to be likely that the model fits based on the data you have.

Just remember that the probability is a probability of getting a particular parameter estimate given the sample data [you take sample data and you estimate a parameter based on that sample data].

The Chi-Square distribution is a statistical result of the log-likelihood but the intuition behind interpreting the actual value is that higher probabilities correspond to better likelihood of a hypothesis being true [at least more evidence shown that it is] and you are in essence using the two probabilities to compare [relatively] just how one hypothesis is going to be better with respect to another.

If the log-likelihood is confusing then just think about then the different probabilities are greater or less than each other and how one is either close to zero or close to one.

Thank you for the response. I guess my questions largely lie in how one constructs the probability distributions themselves. For example, in the first link, they state that the maximum likelihood estimate of ##\mu## is given by ## L(\overline{x}) =
\prod_{i=1}^{n} \frac {1}{\sqrt{2\pi \overline{\sigma}}} e^{\frac{(x_i -\overline{x})^2}{2\overline{\sigma}^2}} ##. Although why these are valid maximum likelihood estimates is not very clear to me.
 
The MLE is an optimization problem to find when the probability is greatest given the sample data.

The probability distributions are either estimated from the data, constructed from assumptions, or can be statistical distributions that are the study of statistical inference.
 
TheCanadian said:
I guess my questions largely lie in how one constructs the probability distributions themselves.

Previous experience suggests that the volume X, the volume in fluid ounces of a randomly selected jar of the company's honey is normally distributed with a known variance of 2.
 
Do you know the Central Limit Theorem? This will be useful to understand a lot of normal distribution statistics.

With MLE you start out with a likelihood function that is either derived or flat out assumed. The derivation is done on first principles of probability modeling [a good example is a binomial distribution for counts of independent events or a Poisson for rates].

You will need to give us more information to assess how the likelihood is derived if it uses a first principles approach or if it's just assumed.
 
  • Like
Likes WWGD and TheCanadian
chiro said:
Do you know the Central Limit Theorem? This will be useful to understand a lot of normal distribution statistics.

With MLE you start out with a likelihood function that is either derived or flat out assumed. The derivation is done on first principles of probability modeling [a good example is a binomial distribution for counts of independent events or a Poisson for rates].

You will need to give us more information to assess how the likelihood is derived if it uses a first principles approach or if it's just assumed.

I am aware of the Central Limit Theorem. So it appears you assume a model and continue adjusting/adding parameters such that your model matches observations?
 
For MLE you assume that every sample point has a distribution and use that to optimize the likelihood that gives the parameters [that you are estimating] to maximize it.

It's a lot like maximizing a cost function or some other attribute - here you are optimizing the probability value given a sample with respect to a parameter you are estimating.

I mention the CLT because it says that given enough information you can approximate any estimator by a Normal distribution and all large scale statistics just assume that and use the Normal for statistical inference.

The likelihood is often chosen by thinking about the process itself and deriving a likelihood function based on those attributes. You can just estimate the distribution and update it from the data but it will lack the fundamentals of a first principles approach since you deduce the likelihood from beliefs and ideas which give context to the data as opposed to just taking it and using the data by itself.
 
  • Like
Likes WWGD and TheCanadian
chiro said:
For MLE you assume that every sample point has a distribution and use that to optimize the likelihood that gives the parameters [that you are estimating] to maximize it.

It's a lot like maximizing a cost function or some other attribute - here you are optimizing the probability value given a sample with respect to a parameter you are estimating.

I mention the CLT because it says that given enough information you can approximate any estimator by a Normal distribution and all large scale statistics just assume that and use the Normal for statistical inference.

The likelihood is often chosen by thinking about the process itself and deriving a likelihood function based on those attributes. You can just estimate the distribution and update it from the data but it will lack the fundamentals of a first principles approach since you deduce the likelihood from beliefs and ideas which give context to the data as opposed to just taking it and using the data by itself.
How do you approximate anything other than the sampling mean with the CLT?
 
  • #10
It's important to realize that the word "liklihood" is used because "liklihood" is not the same thing as "probability". When ##f(x)## is a probability density function, its evaluation ##f(a)## at a number ##x = a## is not a probability. The value ##f(a)## is a probability density. That is what "liklihood" means.

TheCanadian said:
For example, in the first link, they state that the maximum likelihood estimate of ##\mu## is given by ## L(\overline{x}) =
\prod_{i=1}^{n} \frac {1}{\sqrt{2\pi \overline{\sigma}}} e^{\frac{(x_i -\overline{x})^2}{2\overline{\sigma}^2}} ##. Although why these are valid maximum likelihood estimates is not very clear to me.
Let ##g(y) = \prod_{i=1}^{n} \frac {1}{\sqrt{2\pi \overline{\sigma}}} e^{\frac{(x_i -y)^2}
{2\overline{\sigma}^2}} ##. What value of ##y## maximizes ##g(y)##? Is it clear that this is the mathematical question? As to why the answer is ##y_{max} = \frac{ \sum_{i=1}^n x_i}{n}##, it isn't a conclusion from a general principle of some sort. The answer comes from doing the math to maximize the particular function ##g(y)##. We could try to work out that math, if that is your question.

Or are you asking why ##g(y)## is the joint probability density for the measured data?

Hypothesis tests are subjective. A subjective line of thinking about the maximum liklihood test is that we should not reject the null hypothesis about a parameter value unless there is a different parameter value that makes the data a lot more probable. Since "liklihood" doesn't mean "probability", we must be careful in applying this intuition to probability density functions that are multi-modal or that take on maximum values at some number ##x = a## and then fall off sharply around ##x = a##. When such things happen, the maximum liklihood at ##x = a## isn't a good representation of the probability that the random variable is approximately equal to ##a##.
 
Last edited:
  • #11
My understanding is that the likelihood function is the density of the sample data in function of ( unknown) population parameters ## \theta_i##, i.e., ## L(X_1, X_2,..,X_n; \theta_1, \theta_2,..,\theta_n)=P(X_1=x_1,...,X_n=x_n | \theta_1, \theta_2,..,\theta_n) ## and estimators obtained this way have nice properties like being, e.g., almost unbiased and with small variance. I believe in OLS, if errors (residuals) are IID ##(0, \sigma^2)## then the coefficients are the best likelihood estimators of the regression line. EDIT
 
Last edited:
Back
Top