# I Why is the Maximum Likelihood Function a product?

Tags:
1. Nov 3, 2016

### pluviosilla

Why is the Maximum Likelihood function a product?

Explanations of how the Maximum Likelihood function is constructed usually just mention that events are independent and so the probability of several such events is just the product of the separate probabilities. I get the logic w.r.t. independent events. What I don't get is how we can just assume events are independent and (what appears to me in my confusion to be) the disregard of combinations in calculating the probability.

For example, suppose I want to know the probability of 4 heads and 7 tails in 11 coin flips, I may NOT exclude the binomial coefficient in my calculation.

Correct: $P(4, 7) = \binom{N}{k}\cdot p^kq^{N-k} = \binom{11}{4}\cdot p^4q^{7}$

Incorrect: $P(4, 7) = p^4q^{7}$ <<< gives probability of a rigid sequence of 4 heads 7 tails

The reasoning, I take it, is that we take the product of the the respective probabilities of each bit of data in the sample statistic, then we use calculus to determine which population parameters maximize this probability, ignoring combinations because they just constitute a constant coefficient that does not effect the calculation of the maximum.

QUESTION: How can we simply assume events are independent?
QUESTION: Am I right to assume we ignore combinations because they only add a constant coefficient which does not affect the calculation of the maximum?
QUESTION: Are there no practical consequences (problems) arising from dismissing the combinations? With a large sample size we get a probability that is extremely (and artificially) small.

Examples of MLE discussion include the **marvelous** on-line lectures in econometrics by Ben Lambert here and here. This guy gives very accessible, very lucid lectures, but I am still puzzled by the aforementioned points:

Disclaimer: this is NOT homework and I am NOT enrolled in anyone's class.

2. Nov 3, 2016

### PeroK

Even so, you might get a better answer by presenting it as homework. I must admit, I don't really understand what is confusing you. What don't you get about the probability for 11 coin flips? If you add up all the probabilities for 0 heads and 11 tails; 1 head and 10 tails; ...; 11 heads and no tails, you get 1, so how can all these probabilities be too low?

3. Nov 3, 2016

### MarneMath

1. You don't have to assume i.i.d for MLE. You simply need to be able to write a reasonable joint distribution that is independent of the sample size. Also, inherently the maximization has some steps that need to be accounted for with regards to dependency.
2. Maximize the parameter has nothing to do with the distribution of the data, at least no in the sense you are talking about. If I have 10 quarters and I flip them and got only one head. The MLE simply states that my maximized estimator for p is 1/10. Similarly, if I have 10 quarters and 7 heads, the MLE simply states my maximized estimator is 7/10 for p. The MLE is simply trying to maximize the parameter for a distribution based on some sufficient statistic. It would be unreasonable to use 9/10 if my data showed that I only got 1 heads for my trial.
3.The combinations are going to exist for every possible point on the likelihood function. The likelihood function is independent of the sample size. If I have 1,000,000 quarters and 700,000,000 times I get heads, my p parameter is still going to be estimated as 7/10. (When I say independent I don't mean that the sample size doesn't factor in the estimate, but rather the sample size doesn't change the likelihood function.)

4. Nov 4, 2016

### Stephen Tashi

What calculation are you doing ?

The value of p in [0,1] that maximizes the function $p^4(1-p)^7$ is the same value that maximizes the function $\binom{11}{4}\cdot p^4(1-p)^7$ or the function $98p^4(1-p)^7$ or any other positive constant times $p^4(1-p)^7$.

It is technically correct to include the binomial coefficient in a calculation of the maximum liklihood estimate for p when the given data is "7 heads out of 11". It just happens that including it as a factor doesn't change the answer.

There is substance to your general question. The scenario of maximum liklihood estimation is that we pick a family of distributions defined by parameters and estimate the parameters by selecting those that maximize the liklihood of "the data". This raises the question: What exactly is "the data"?

In an actual experiment of coin flips there would be a specific sequence of heads and tails that would be "the data" if it were recorded. However, the person performing the experiment might give an incomplete description of the result by saying "heads came up 7 out of 11 times". An incompetent technician might even report something like "I remember the first two tosses came up heads and the last one did also, but I've forgotten what happened on the others".

To me, it seems possible to have situation where the maximum liklihood estimator for one description of "the data" would give different results that the maximum liklihood estimator for a different description of "the data".

Suppose the outcome of a 11 independent tosses of coin is called a "primo" if a prime number of heads comes up and a "non-primo" if a composite number of heads comes up. If "the data" for the experiment is that "The outcome was a primo", we could try to do a maximum liklihood estimate for p. In that situation, I think we would have to consider the binomial coefficients in our calculations.

In you example, it just happens that an incomplete description of 11 tosses gives the same maximum liklihood estimate for p as a complete description of the sequence of 11 tosses.