In summary, the posterior predictive distribution is a distribution that predicts future observations. Bayesian statistics uses this distribution to make predictions about future observations.

#### Dale

Mentor
Confessions of a moderate Bayesian, part 4
Bayesian statistics by and for non-statisticians
Read part 1: How to Get Started with Bayesian Statistics
Read part 2: Frequentist Probability vs Bayesian Probability
Read part 3: How Bayesian Inference Works in the Context of Science
Predictive distributions
A predictive distribution is a distribution that we expect for future observations. In other words, instead of describing the mean or the standard deviation of the data we already have, it describes unobserved data.
Predictive distributions are not unique to Bayesian statistics, but one huge advantage of Bayesian statistics is that the Bayesian posterior...

Last edited by a moderator:
kith, stevendaryl, Choppy and 2 others
I am, ashamed to say I haven't read any of your 4 blogs on Bayes yet.
However, to whet my appetite, help me if you care to re the following:

I have a population of 1000. I wantr to test an individual for a certain disease d.
Known: average no. of actual diseased individuals = 1 in 1000. My test is 99 effective in that there is only a 1 chance in 100 that a diseased ind. will register a negative test result. My test is also 99% effective in that it reports a healthy person with a negative test result 99% of the time.

Calling d = diseased, d* = healthy, + a positive test result and - a negative result, the post-test result should be

## p(d/-) = \frac {p(-/d) p(d)} {p(-/d) p(d) + p(d*) p(-/d*)}. ##
I hope that's right.
Now, putting numbers in it, the pre-test numbers are (I think)
## p(-/d) = 0.01 ##
## p(d) = 0.001 ##
## p(d*) = 0.999 ##
## p(-/d*) = 0.99 ##

giving ## p(d/-) = \frac { .01 x .001} {.01 x .001 + .999 x .99} ~ 10^{-5}##.

Problem: how can p(d/-) be less than 0.001? And by 3 orders of magnitude?

Get me past this hurdle so I can start reading your 1st blog!
Thx.

BTW I did p(d/+) and it came out right at about 9%.

Last edited by a moderator:
Dale
rude man said:
Problem: how can p(d/-) be less than 0.001? And by 3 orders of magnitude?

Get me past this hurdle so I can start reading your 1st blog!
Thx.

BTW I did p(d/+) and it came out right at about 9%.

Well, p(d/-) is roughly the probability of two unlikely things happening: That he has the disease, and that the test is wrong.

rude man said:
I hope that's right.
Yes, that formula is indeed correct and it looks like you did the arithmetic correct also. What you are seeing is the correct and expected result.

rude man said:
Problem: how can p(d/-) be less than 0.001?
Remember that p(d) was 0.001 is our prior belief. So without even having the test we were pretty convinced that he didn’t have the disease simply because the disease is rare. When we get a negative result for the test we are going to be even more certain that he doesn’t have the disease. So a negative test will always reduce our belief that the disease is present.

It wouldn't make sense to say "Yesterday I was pretty confident that he is not sick since it is a rare disease, but today after the test was negative I am concerned that he is probably sick". Instead what we would say is "Yesterday I was pretty confident that he is not sick since it is a rare disease, and today after the test was negative I am utterly convinced that he is not sick."

As @stevendaryl mentioned believing that he is sick requires two unlikely things, first he had to be the poor rare person that actually has the disease, and second he has to coincidentally be the rare sick person that gets a false negative. Believing that he is well requires two likely things, first he had to be in the bulk of the population that is well and second he has to be in the bulk of the well population that gets a true negative. Two likely things are much more likely than two unlikely things. So the direction of change, to less than 0.001 is correct.

rude man said:
And by 3 orders of magnitude?
It is actually about 2 orders of magnitude. From ##1 \ 10^{-3}## to ##1 \ 10^{-5}##. As a rough estimate you can look at the Bayes factor for a negative test. The Bayes factor is $$\frac{P(-|d)}{P(-|d*)}=\frac{1-0.99}{0.99}\approx 0.01$$ so we do expect it to change by about two orders of magnitude.

You did the math correctly.

To get samples of the posterior predictive distribution we take a sample from the posterior parameter distribution, get a set of parameters, plug those parameters into the model, and generate a predicted sample of the data. The distribution of those predictions is the posterior predictive distribution.

Source https://www.physicsforums.com/insights/posterior-predictive-distributions-in-bayesian-statistics/

That definition of the posterior predictive distribution is ambiguous - perhaps the intent is use examples to make it precise, but I'd have to learn the computer code to figure it out.

Suppose we have a random variable ##X## whose distribution is given by a function ##F(x,\Theta)## where ##\Theta## is the parameter of ##F##. We make the Bayesian assumption that ##\Theta## has a prior distribution. Based on a set of data ##D## we get a posterior distribution ##g(\theta)## for ##\Theta##. Following the definition given above, we generate batches of data by repeating the following process ##M## times: Generate one value ##\theta_i## of ##\Theta## from the distribution ##g##. Then generate a lot of values ##x_{i,j}, j = 1,2,3,...N## of ##X## from the distribution ##F(x,\theta_i)##.

The result of this simulation process is a "distribution of distributions". If we lump all the ##X_{i,j}, i = 1,2,..M, j = 1,2,...N## data into one histogram and consider it an approximation for the distribution of a single random variable ##Y## then the probability model for ##Y## does not match the Bayesian assumption we adopted because the Bayesian assumption is that a single value of the parameter ##\Theta## was used when Nature generated the data ##D##.

( Of course, I'm speaking as an "objective" Bayesian. If one thinks of prior and posterior distributions as measuring "degrees of belief" then who is to say whether lumping the simulation data is into a single histogram is valid?)

It seems to me that the complete definition of a "predictive distribution" should say that it is a posterior distribution of some parameter, which may be different than the parameter that was assigned a prior distribution.

Each simulation of a batch of data from a distribution ##F(x,\theta_i)## can be used to make point estimates of a different parameter ##\gamma_i## of the distribution ##F(x,\theta_i)##. ( For example, if ##\theta_i## is the (population) mean of the distribution then the sample variance of the data we generated can be used make a point estimate of the variance ##\sigma^2_i## of ##F(x,\theta_i)##.) The simulation process provides different batches of data, so it provides a histogram of ##\gamma_i, i = 1,2,...M## that estimates the posterior distribution of ##\gamma##. We can make a point estimate of ##\gamma## using the original data ##D## and see where this estimate falls on the histogram of the simulated data for ##\gamma##.

Dale
Stephen Tashi said:
That definition of the posterior predictive distribution is ambiguous - perhaps the intent is use examples to make it precise, but I'd have to learn the computer code to figure it out.
Yes, it is a bit ambiguous, sorry about that. Again, I am not a statistician so I will not be able to be as rigorous as one. This is, as claimed, by and for non-statisticians.

That said, that section is not intended to be even an ambiguous definition of the posterior predictive distribution. It is simply a description of how you can obtain samples of the posterior predictive distribution from samples of the posterior distribution.

Stephen Tashi said:
we generate batches of data by repeating the following process M times: Generate one value θi of Θ from the distribution g. Then generate a lot of values xi,j,j=1,2,3,...N of X from the distribution F(x,θi).
Typically ##N=1##. In principle you could have ##N>1## but that is not generally done. That is why I said "generate a predicted sample". I don't know if there is a specific reason for that, but it is what I have always seen done in the literature and so I have copied that.

Stephen Tashi said:
probability model for Y does not match the Bayesian assumption we adopted because the Bayesian assumption is that a single value of the parameter Θ was used when Nature generated the data D.
I disagree. Even if nature has some underlying process with a single value of the parameter, we don't know what the value of that parameter is. So the posterior predictive distribution is the best prediction we can make of future observations, given our current data. Any single value of the parameters that we select would underestimate our uncertainty in the prediction. Accounting for our uncertainty in the parameters is definitely Bayesian, selecting a single value of the parameters would not be a good Bayesian approach even if we believe that nature has such a single value.

Stephen Tashi said:
Each simulation of a batch of data from a distribution F(x,θi) can be used to make point estimates of a different parameter γi of the distribution F(x,θi). ( For example, if θi is the (population) mean of the distribution then the sample variance of the data we generated can be used make a point estimate of the variance σi2 of F(x,θi).) The simulation process provides different batches of data, so it provides a histogram of γi,i=1,2,...M that estimates the posterior distribution of γ. We can make a point estimate of γ using the original data D and see where this estimate falls on the histogram of the simulated data for γ.
I haven't seen it done that way, but I don't know why it couldn't be done that way. Each individual batch would systematically understate the uncertainty in the predicted data, but I am not sure that would mean that the resulting histogram of ##\gamma## would similarly have artificially reduced uncertainty. It would be plausible to me that the inter-batch variation would adequately represent our uncertainty in ##\gamma##.

Dale said:
I disagree. Even if nature has some underlying process with a single value of the parameter, we don't know what the value of that parameter is. So the posterior predictive distribution is the best prediction we can make of future observations, given our current data.
It would be interesting to define what "best prediction" means in this case. The various senses of "best" for point estimators are well know ( unbiased, minimum variance, maximum liklihood, etc.). But what does it mean for an estimator whose outcome is a distribution to be the best estimator for a distribution?

Stephen Tashi said:
It would be interesting to define what "best prediction" means in this case. The various senses of "best" for point estimators are well know ( unbiased, minimum variance, maximum liklihood, etc.). But what does it mean for an estimator whose outcome is a distribution to be the best estimator for a distribution?
I certainly cannot be rigorous, but I think that I adequately demonstrated several very useful features of the posterior predictive distribution. In particular, one feature that I would like from a “best estimator” distribution is that it neither ignore outliers nor overfit them. I was quite excited to find exactly that in my experience applying this method to real-world data. It was one of those things that I didn’t know I wanted until I saw it.

Last edited:
Stephen Tashi