Why is the Maximum Likelihood Function a product?

Click For Summary

Discussion Overview

The discussion centers on the Maximum Likelihood Estimation (MLE) function, particularly why it is expressed as a product of probabilities. Participants explore the implications of assuming independence among events, the role of combinations in probability calculations, and the potential consequences of disregarding these combinations in MLE.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant questions the assumption of independence in MLE and whether it is valid to ignore combinations, suggesting that excluding the binomial coefficient could lead to artificially small probabilities.
  • Another participant argues that MLE does not require the assumption of independent and identically distributed (i.i.d) events, as long as a reasonable joint distribution can be established.
  • A different viewpoint emphasizes that the maximization process in MLE is based on sufficient statistics and does not depend on the distribution of the data in the way the original poster suggests.
  • One participant clarifies that including the binomial coefficient does not change the maximum likelihood estimate for the parameter, as the same value maximizes both the likelihood function with and without the coefficient.
  • Another participant introduces the idea that different descriptions of "the data" could lead to different maximum likelihood estimates, raising questions about what constitutes the data in MLE.

Areas of Agreement / Disagreement

Participants express differing views on the necessity of independence assumptions and the treatment of combinations in MLE calculations. There is no consensus on whether ignoring combinations has practical consequences or if it is acceptable in the context of MLE.

Contextual Notes

Participants highlight that the likelihood function's independence from sample size does not imply that sample size does not influence parameter estimation. The discussion also touches on the ambiguity of what constitutes "the data" in MLE, which may affect the outcomes of estimations.

pluviosilla
Messages
17
Reaction score
0
Why is the Maximum Likelihood function a product?

Explanations of how the Maximum Likelihood function is constructed usually just mention that events are independent and so the probability of several such events is just the product of the separate probabilities. I get the logic w.r.t. independent events. What I don't get is how we can just assume events are independent and (what appears to me in my confusion to be) the disregard of combinations in calculating the probability.

For example, suppose I want to know the probability of 4 heads and 7 tails in 11 coin flips, I may NOT exclude the binomial coefficient in my calculation.

Correct: P(4, 7) = \binom{N}{k}\cdot p^kq^{N-k} = \binom{11}{4}\cdot p^4q^{7}

Incorrect: P(4, 7) = p^4q^{7} <<< gives probability of a rigid sequence of 4 heads 7 tails

The reasoning, I take it, is that we take the product of the the respective probabilities of each bit of data in the sample statistic, then we use calculus to determine which population parameters maximize this probability, ignoring combinations because they just constitute a constant coefficient that does not effect the calculation of the maximum.

QUESTION: How can we simply assume events are independent?
QUESTION: Am I right to assume we ignore combinations because they only add a constant coefficient which does not affect the calculation of the maximum?
QUESTION: Are there no practical consequences (problems) arising from dismissing the combinations? With a large sample size we get a probability that is extremely (and artificially) small.

Examples of MLE discussion include the **marvelous** on-line lectures in econometrics by Ben Lambert here and here. This guy gives very accessible, very lucid lectures, but I am still puzzled by the aforementioned points:

Disclaimer: this is NOT homework and I am NOT enrolled in anyone's class.
 
Physics news on Phys.org
pluviosilla said:
Disclaimer: this is NOT homework and I am NOT enrolled in anyone's class.

Even so, you might get a better answer by presenting it as homework. I must admit, I don't really understand what is confusing you. What don't you get about the probability for 11 coin flips? If you add up all the probabilities for 0 heads and 11 tails; 1 head and 10 tails; ...; 11 heads and no tails, you get 1, so how can all these probabilities be too low?
 
1. You don't have to assume i.i.d for MLE. You simply need to be able to write a reasonable joint distribution that is independent of the sample size. Also, inherently the maximization has some steps that need to be accounted for with regards to dependency.
2. Maximize the parameter has nothing to do with the distribution of the data, at least no in the sense you are talking about. If I have 10 quarters and I flip them and got only one head. The MLE simply states that my maximized estimator for p is 1/10. Similarly, if I have 10 quarters and 7 heads, the MLE simply states my maximized estimator is 7/10 for p. The MLE is simply trying to maximize the parameter for a distribution based on some sufficient statistic. It would be unreasonable to use 9/10 if my data showed that I only got 1 heads for my trial.
3.The combinations are going to exist for every possible point on the likelihood function. The likelihood function is independent of the sample size. If I have 1,000,000 quarters and 700,000,000 times I get heads, my p parameter is still going to be estimated as 7/10. (When I say independent I don't mean that the sample size doesn't factor in the estimate, but rather the sample size doesn't change the likelihood function.)
 
  • Like
Likes   Reactions: pluviosilla
pluviosilla said:
For example, suppose I want to know the probability of 4 heads and 7 tails in 11 coin flips, I may NOT exclude the binomial coefficient in my calculation.

What calculation are you doing ?

Correct: P(4, 7) = \binom{N}{k}\cdot p^kq^{N-k} = \binom{11}{4}\cdot p^4q^{7}Incorrect: P(4, 7) = p^4q^{7} <<< gives probability of a rigid sequence of 4 heads 7 tails

The value of p in [0,1] that maximizes the function ##p^4(1-p)^7## is the same value that maximizes the function ##\binom{11}{4}\cdot p^4(1-p)^7## or the function ##98p^4(1-p)^7## or any other positive constant times ##p^4(1-p)^7##.

It is technically correct to include the binomial coefficient in a calculation of the maximum liklihood estimate for p when the given data is "7 heads out of 11". It just happens that including it as a factor doesn't change the answer.There is substance to your general question. The scenario of maximum liklihood estimation is that we pick a family of distributions defined by parameters and estimate the parameters by selecting those that maximize the liklihood of "the data". This raises the question: What exactly is "the data"?

In an actual experiment of coin flips there would be a specific sequence of heads and tails that would be "the data" if it were recorded. However, the person performing the experiment might give an incomplete description of the result by saying "heads came up 7 out of 11 times". An incompetent technician might even report something like "I remember the first two tosses came up heads and the last one did also, but I've forgotten what happened on the others".

To me, it seems possible to have situation where the maximum liklihood estimator for one description of "the data" would give different results that the maximum liklihood estimator for a different description of "the data".

Suppose the outcome of a 11 independent tosses of coin is called a "primo" if a prime number of heads comes up and a "non-primo" if a composite number of heads comes up. If "the data" for the experiment is that "The outcome was a primo", we could try to do a maximum liklihood estimate for p. In that situation, I think we would have to consider the binomial coefficients in our calculations.

In you example, it just happens that an incomplete description of 11 tosses gives the same maximum liklihood estimate for p as a complete description of the sequence of 11 tosses.
 
  • Like
Likes   Reactions: pluviosilla

Similar threads

  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 16 ·
Replies
16
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
Replies
1
Views
3K