# I Probability notation

Tags:
1. Nov 12, 2016

### npit

There is some confusion regarding notation of the arguments of a probability distribution.
For example, MLE is defined as estimating the distrubution that gives the maximum probability of the observations $x_i$ given the distribution parameters, $p(x|\theta)$ . However my instructor stressed that since $\theta$ is not a random variable but a parameter of the distribution, it is meaningless to take a conditional on it, and uses the notation $p(x;\theta)$.

I understand that the word "given" can be misleading since in natural language it can very well mean "given the specific value of the parameter" but in probability it refers to conditional probability.

My question is , if the use of the | notation is valid, or the view of $\theta$ as a random variable is illegal.
Isn't MLE, however, in comparison to the MAP, viewed as a specific case of MAP where the probability of [itex] \theta [\itex] is uniform? So is the use of the | valid ?

2. Nov 12, 2016

### PeroK

The point is that $\theta$ has a single, definite value, which you are trying to estimate. It is not a variable.

Perhaps a crude analogy is the parameters $a, b, c$ in a quadratic expression: $ax^2 + bx + c$. These are not variables.

3. Nov 12, 2016

### Stephen Tashi

That is the underlying motivation for a maximum liklihood estimate, but technically such an estimate maximizes the "liklihood" of the data, not the "probability" of the data. For a discrete random variable, the probability mass function f(.) evaluated at a value v can be interpreted as the "probability of v", but for a continuous random variable, the probability density function f(.) evaluated at a value v, is said to give the "liklihood" of v. The "probability" of an outcome of exactly v is zero for many density functions.
Glancing at the topic on Wikipedia and Wolfram, those authors don't follow the the notation used by your instructor. However, your instructor is making a good argument from the viewpoint of "frequentist" statistics. There is a distinct difference in treating a quantity in a statistical problem as "fixed, but unknown" versus "a fixed value that is a realization of a random variable". Frequentist statistics often uses variables representing "fixed, but unknown" quantities.

That could be the distinction between "$\theta$" is a symbol representing a fixed but unspecified value of an ordinary "variable" and "$\theta$" as a symbol for the realization of a "random variable". In both cases $\theta$ can represent a fixed value. The distinction between the two situations is what $\theta$ is a value of. By analogy, in a physics book, "$\theta$" might be used to represent a specific angle of a triangle on one page and on another page it might represent the specific phase angle in a wave function.

A good example of the distinction between "fixed, but unknown" and "realized value of an random variable" is in interpreting "confidence intervals". Suppose you work a typical confidence interval problem and show "There is a 0.90 probability that the population mean is within 23.6 of the sample mean". If the mean of the sample is 130, you may not claim that "There is a 0.90 probability that the population mean is in the interval $[130 - 23.6, 130+23.6]$". The whole procedure of computing the "confidence interval" width is based on assuming the population man has a "fixed, but unknown" value. So you can't wind up your work by saying something about the probabilty that the population mean is somewhere. There's no probability about it; it has a fixed but unknown value.

If you want to make claims like " "There is a 0.90 probability that the population mean is in the interval $[130 - 23.6, 130+23.6]$", you have to define the problem differently. If you take a Bayesian approach, you assume the population mean has some prior distribution. Then you can compute a "credible interval" and make specific claims about the probability of the population mean being in it.

The legality of the view depends on how the problem is defined. As to notation, people use all sorts of ambiguous notation; it's largely a matter of personal preference.

MAP and MLE solve two different problems. There are cases where they produce the same numerical answer. The current Wikipedia article says:

That is a statement that two different problems may have the same numerical answer, not that the definition of one problem is a special case of the other.

For example, if the parameter is the mean of a normal distribution, how are you going to put a "uniform prior" on it ? You will have to assume it is in some bounded interval. Setting up an estimation problem as MAP involves making more assumptions than solving an MLE problem.

Maximum a posteriori estimation assumes that the parameter $\theta$ is a realized value of some random variable. Having done that, there are cases where maximizing the posterior liklihood of the data gives the same answer as ignoring the prior distribution of $\theta$. For example if the family of distributions for the data has densities $f(x;\theta)$ and the density of $\theta$ is given by $g(\theta)$ and the observed data is $x_0$, it might happen that maximizing $f(x_0,\theta)$ as a function of parameter $\theta$ gives the same answer for $\theta$ as maximizing the liklihood function $f(x_0|\theta) g(\theta)$. That would be an example of two different mathematical problems having the same answer.

What justification would you give for using the MLE method? You can make intuitive arguments like "Well, what value of the parameter should I have picked? Should I have picked one that made the data least likely? ... Huh ?". However, the MLE problem doesn't conclude anything (mathematically) about the probability of its result being close to the "true" value of the parameter being estimated.

4. Nov 30, 2016

### haruspex

Your instructor is too focused on conditional probability. The vertical bar is widely used in set theory to mean "given the following fact". In the case of the parameter theta, the notation is shorthand for "given that the parameter of interest has the value theta".
As for variables versus constants, Q when is a constant not a constant? A when it varies. Saying that a symbol in algebra represents a constant just means we will be treating it as constant for present purposes. The equation that is obtained had better work for all values in some defined range, though, or the equation is not useful.
Hmm, but can I say I have employed methods to estimate the mean as 130, and I calculate that there is a 0.9 chance that my estimate is within 23.6 of the true mean?

But all that is more philosophy than mathematics. A bit like interpretations of quantum theory. The MLE issue is more serious. Any half reasonable Bayesian approach is likely to yield a more reliable answer. The old standard: how fair is this coin? I get 53 heads from 100 tosses. MLE gives one answer; Bayes says, get serious, the coin was very likely within a fraction of a percent of fair in the first place, and it still very likely is.
Beyond that, what I hardly ever see discussed is the question of how the statistical decision will be used. An error in one direction on estimating the statistic might be a lot more unfortunate than an equal magnitude error in the other direction. If you take as null hypothesis that a compound does not cause cancer, an abstract scientific approach might set a 95% confidence level to prove that it does, while a public health officer would do better to go for 5%.

End of rant.

5. Nov 30, 2016

### Stephen Tashi

Within the context of elementary mathematics, I think it's impossible to precisely define the concept of a "variable"! It isn't actually a symbol that is taking a varying value as we look at it on a page of a book. (Maybe we could implement a computer display where the place occupied by a "variable" displayed values that rapidly change as time passes.) In mathematics outside the domain of formal logic, we resort to using "variable" as a term of "common speech" rather than a technical term.

Yes, in the formal definition of a variable (in logic or in computer languages), a "variable" has a "scope" and within that "scope", we do manipulate and reason with that variable as if it has the same value. So perhaps a "constant" is just the specific type of "variable" whose "scope" is "the whole computer" or "all of nature", or "the entire book" or "the same until I tell you otherwise".

You can say that if you use a Bayesian approach. Bayesian methods allow such inferences. However, you shouldn't make such a claim if you are presenting your work as the standard type confidence interval analysis taught in introductory statistics courses because that type of analysis is not a Bayesian approach.

I'm tempted to say that the conventional approach to frequency analysis is present the estimated mean of 130 , the 90% confidence interval width of 23.6 and then leave it up to audience to draw the mathematically mistaken implication that there is a 90% change that the population mean is in [130 - 23.6, 130 + 23.6]. Or we can simply present a graph that shows 130 with an "error bar" around it and let the audience draw a similar wrong conclusion.

The audience's instincts are mathematically wrong, but if they work out well in practice then this is a practical argument in favor of Bayesian methods.

Let's hope not!

If we have a mathematical problem that is well-posed then it has a definite answer and the answer has a specific interpretation in the context of the problem. There is no mathematical ambiguity about frequentist confidence intervals and their meaning. The place where philosophy meets statistics is in decisions about how a real life situation is modeled. That involves subjective choices. "Confidence intervals" come from one type of mathematical model. "Credible intervals" come from a different type of mathematical model. The choice about which type of model is to use is subjective, but the interpretation of the intervals within the chosen model is not subjective.

That's a statistical(!) claim. Presumably its a claim about the "population of all problems" or the "population of all problems of interest to me".

Ok, but that's introducing a time varying process in the the model ( "in the first place" vs "still...is").

I agree. Even though there is a field of mathematics called "optimal statistical decisions", I rarely see it applied. However, from a practical point of view this is understandable. Applying optimal statistical decision theory to a problem requires having more "givens" and in real life, many of the "givens" must come from making more assumptions.

6. Nov 30, 2016

### haruspex

Ok, the point you are making there is that the probability of the data given the hypothesis is not the same as the probability of the hypothesis given the data. That is indeed a crucial flaw in classical hypothesis testing, only rectified by Bayesian methods, as far as I know.
I had thought you were arguing that we cannot the discuss the probability of some existent but unknown fact being true. I have heard that, and regard it as having no practical relevance.