In summary, the posterior predictive distribution is a distribution that predicts future observations. Bayesian statistics uses this distribution to make predictions about future observations.
  • #1
35,204
13,431
Confessions of a moderate Bayesian, part 4
Bayesian statistics by and for non-statisticians
Read part 1: How to Get Started with Bayesian Statistics
Read part 2: Frequentist Probability vs Bayesian Probability
Read part 3: How Bayesian Inference Works in the Context of Science
Predictive distributions
A predictive distribution is a distribution that we expect for future observations. In other words, instead of describing the mean or the standard deviation of the data we already have, it describes unobserved data.
Predictive distributions are not unique to Bayesian statistics, but one huge advantage of Bayesian statistics is that the Bayesian posterior...

Continue reading...
 
Last edited by a moderator:
  • Like
Likes kith, stevendaryl, Choppy and 2 others
Physics news on Phys.org
  • #2
I am, ashamed to say I haven't read any of your 4 blogs on Bayes yet.
However, to whet my appetite, help me if you care to re the following:

I have a population of 1000. I wantr to test an individual for a certain disease d.
Known: average no. of actual diseased individuals = 1 in 1000. My test is 99 effective in that there is only a 1 chance in 100 that a diseased ind. will register a negative test result. My test is also 99% effective in that it reports a healthy person with a negative test result 99% of the time.

Calling d = diseased, d* = healthy, + a positive test result and - a negative result, the post-test result should be

## p(d/-) = \frac {p(-/d) p(d)} {p(-/d) p(d) + p(d*) p(-/d*)}. ##
I hope that's right.
Now, putting numbers in it, the pre-test numbers are (I think)
## p(-/d) = 0.01 ##
## p(d) = 0.001 ##
## p(d*) = 0.999 ##
## p(-/d*) = 0.99 ##

giving ## p(d/-) = \frac { .01 x .001} {.01 x .001 + .999 x .99} ~ 10^{-5}##.

Problem: how can p(d/-) be less than 0.001? And by 3 orders of magnitude?

Get me past this hurdle so I can start reading your 1st blog!
Thx.

BTW I did p(d/+) and it came out right at about 9%.
 
Last edited by a moderator:
  • Like
Likes Dale
  • #3
rude man said:
Problem: how can p(d/-) be less than 0.001? And by 3 orders of magnitude?

Get me past this hurdle so I can start reading your 1st blog!
Thx.

BTW I did p(d/+) and it came out right at about 9%.

Well, p(d/-) is roughly the probability of two unlikely things happening: That he has the disease, and that the test is wrong.
 
  • #4
rude man said:
I hope that's right.
Yes, that formula is indeed correct and it looks like you did the arithmetic correct also. What you are seeing is the correct and expected result.

rude man said:
Problem: how can p(d/-) be less than 0.001?
Remember that p(d) was 0.001 is our prior belief. So without even having the test we were pretty convinced that he didn’t have the disease simply because the disease is rare. When we get a negative result for the test we are going to be even more certain that he doesn’t have the disease. So a negative test will always reduce our belief that the disease is present.

It wouldn't make sense to say "Yesterday I was pretty confident that he is not sick since it is a rare disease, but today after the test was negative I am concerned that he is probably sick". Instead what we would say is "Yesterday I was pretty confident that he is not sick since it is a rare disease, and today after the test was negative I am utterly convinced that he is not sick."

As @stevendaryl mentioned believing that he is sick requires two unlikely things, first he had to be the poor rare person that actually has the disease, and second he has to coincidentally be the rare sick person that gets a false negative. Believing that he is well requires two likely things, first he had to be in the bulk of the population that is well and second he has to be in the bulk of the well population that gets a true negative. Two likely things are much more likely than two unlikely things. So the direction of change, to less than 0.001 is correct.

rude man said:
And by 3 orders of magnitude?
It is actually about 2 orders of magnitude. From ##1 \ 10^{-3}## to ##1 \ 10^{-5}##. As a rough estimate you can look at the Bayes factor for a negative test. The Bayes factor is $$\frac{P(-|d)}{P(-|d*)}=\frac{1-0.99}{0.99}\approx 0.01$$ so we do expect it to change by about two orders of magnitude.

You did the math correctly.
 
  • #5
To get samples of the posterior predictive distribution we take a sample from the posterior parameter distribution, get a set of parameters, plug those parameters into the model, and generate a predicted sample of the data. The distribution of those predictions is the posterior predictive distribution.

Source https://www.physicsforums.com/insights/posterior-predictive-distributions-in-bayesian-statistics/

That definition of the posterior predictive distribution is ambiguous - perhaps the intent is use examples to make it precise, but I'd have to learn the computer code to figure it out.

Suppose we have a random variable ##X## whose distribution is given by a function ##F(x,\Theta)## where ##\Theta## is the parameter of ##F##. We make the Bayesian assumption that ##\Theta## has a prior distribution. Based on a set of data ##D## we get a posterior distribution ##g(\theta)## for ##\Theta##. Following the definition given above, we generate batches of data by repeating the following process ##M## times: Generate one value ##\theta_i## of ##\Theta## from the distribution ##g##. Then generate a lot of values ##x_{i,j}, j = 1,2,3,...N## of ##X## from the distribution ##F(x,\theta_i)##.

The result of this simulation process is a "distribution of distributions". If we lump all the ##X_{i,j}, i = 1,2,..M, j = 1,2,...N## data into one histogram and consider it an approximation for the distribution of a single random variable ##Y## then the probability model for ##Y## does not match the Bayesian assumption we adopted because the Bayesian assumption is that a single value of the parameter ##\Theta## was used when Nature generated the data ##D##.

( Of course, I'm speaking as an "objective" Bayesian. If one thinks of prior and posterior distributions as measuring "degrees of belief" then who is to say whether lumping the simulation data is into a single histogram is valid?)

It seems to me that the complete definition of a "predictive distribution" should say that it is a posterior distribution of some parameter, which may be different than the parameter that was assigned a prior distribution.

Each simulation of a batch of data from a distribution ##F(x,\theta_i)## can be used to make point estimates of a different parameter ##\gamma_i## of the distribution ##F(x,\theta_i)##. ( For example, if ##\theta_i## is the (population) mean of the distribution then the sample variance of the data we generated can be used make a point estimate of the variance ##\sigma^2_i## of ##F(x,\theta_i)##.) The simulation process provides different batches of data, so it provides a histogram of ##\gamma_i, i = 1,2,...M## that estimates the posterior distribution of ##\gamma##. We can make a point estimate of ##\gamma## using the original data ##D## and see where this estimate falls on the histogram of the simulated data for ##\gamma##.
 
  • Like
Likes Dale
  • #6
Stephen Tashi said:
That definition of the posterior predictive distribution is ambiguous - perhaps the intent is use examples to make it precise, but I'd have to learn the computer code to figure it out.
Yes, it is a bit ambiguous, sorry about that. Again, I am not a statistician so I will not be able to be as rigorous as one. This is, as claimed, by and for non-statisticians.

That said, that section is not intended to be even an ambiguous definition of the posterior predictive distribution. It is simply a description of how you can obtain samples of the posterior predictive distribution from samples of the posterior distribution.

Stephen Tashi said:
we generate batches of data by repeating the following process M times: Generate one value θi of Θ from the distribution g. Then generate a lot of values xi,j,j=1,2,3,...N of X from the distribution F(x,θi).
Typically ##N=1##. In principle you could have ##N>1## but that is not generally done. That is why I said "generate a predicted sample". I don't know if there is a specific reason for that, but it is what I have always seen done in the literature and so I have copied that.

Stephen Tashi said:
probability model for Y does not match the Bayesian assumption we adopted because the Bayesian assumption is that a single value of the parameter Θ was used when Nature generated the data D.
I disagree. Even if nature has some underlying process with a single value of the parameter, we don't know what the value of that parameter is. So the posterior predictive distribution is the best prediction we can make of future observations, given our current data. Any single value of the parameters that we select would underestimate our uncertainty in the prediction. Accounting for our uncertainty in the parameters is definitely Bayesian, selecting a single value of the parameters would not be a good Bayesian approach even if we believe that nature has such a single value.

Stephen Tashi said:
Each simulation of a batch of data from a distribution F(x,θi) can be used to make point estimates of a different parameter γi of the distribution F(x,θi). ( For example, if θi is the (population) mean of the distribution then the sample variance of the data we generated can be used make a point estimate of the variance σi2 of F(x,θi).) The simulation process provides different batches of data, so it provides a histogram of γi,i=1,2,...M that estimates the posterior distribution of γ. We can make a point estimate of γ using the original data D and see where this estimate falls on the histogram of the simulated data for γ.
I haven't seen it done that way, but I don't know why it couldn't be done that way. Each individual batch would systematically understate the uncertainty in the predicted data, but I am not sure that would mean that the resulting histogram of ##\gamma## would similarly have artificially reduced uncertainty. It would be plausible to me that the inter-batch variation would adequately represent our uncertainty in ##\gamma##.
 
  • #7
Dale said:
I disagree. Even if nature has some underlying process with a single value of the parameter, we don't know what the value of that parameter is. So the posterior predictive distribution is the best prediction we can make of future observations, given our current data.
It would be interesting to define what "best prediction" means in this case. The various senses of "best" for point estimators are well know ( unbiased, minimum variance, maximum liklihood, etc.). But what does it mean for an estimator whose outcome is a distribution to be the best estimator for a distribution?
 
  • #8
Stephen Tashi said:
It would be interesting to define what "best prediction" means in this case. The various senses of "best" for point estimators are well know ( unbiased, minimum variance, maximum liklihood, etc.). But what does it mean for an estimator whose outcome is a distribution to be the best estimator for a distribution?
I certainly cannot be rigorous, but I think that I adequately demonstrated several very useful features of the posterior predictive distribution. In particular, one feature that I would like from a “best estimator” distribution is that it neither ignore outliers nor overfit them. I was quite excited to find exactly that in my experience applying this method to real-world data. It was one of those things that I didn’t know I wanted until I saw it.
 
Last edited:
  • Like
Likes Stephen Tashi

1. What is a posterior predictive distribution?

A posterior predictive distribution is a probability distribution that represents the uncertainty of future observations given the data and model parameters in a Bayesian framework. It takes into account both the prior beliefs and the observed data to make predictions about future outcomes.

2. How is a posterior predictive distribution calculated?

A posterior predictive distribution is calculated by integrating the likelihood function with respect to the posterior distribution of the model parameters. This can be done analytically for simple models, but for more complex models, numerical methods such as Markov chain Monte Carlo (MCMC) are used.

3. What is the difference between a prior predictive distribution and a posterior predictive distribution?

A prior predictive distribution represents the uncertainty of future observations based on the prior beliefs about the model parameters, without taking into account any observed data. In contrast, a posterior predictive distribution incorporates both the prior beliefs and the observed data to make predictions about future outcomes.

4. How can posterior predictive distributions be used in decision making?

Posterior predictive distributions can be used to make decisions by comparing the predictive distributions of different models and choosing the one that best fits the observed data. They can also be used to assess the uncertainty of future outcomes and inform decision making under uncertainty.

5. What are the advantages of using posterior predictive distributions in Bayesian statistics?

Posterior predictive distributions allow for the incorporation of prior beliefs and observed data in making predictions, which can lead to more accurate and informative results compared to traditional frequentist methods. They also provide a measure of uncertainty in the predictions, which is important for decision making and risk assessment.

Similar threads

  • Sticky
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
3K
  • Sticky
  • Set Theory, Logic, Probability, Statistics
2
Replies
44
Views
7K
  • Sticky
  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
26
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
908
  • Quantum Interpretations and Foundations
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Beyond the Standard Models
Replies
5
Views
2K
Back
Top