I Questions about error range from Bayesian statistics

  • Thread starter Buzz Bloom
  • Start date

Buzz Bloom

Gold Member
2,035
330
Summary
In several threads I have seen discussions of the difference between using Bayesian and frequentist statistics. In particular, the use of Bayesian statistics in cosmology and frequentist statistics in QM. I have come to understand that Bayesian statistics generally do not produce any error range and confidence level, in contrast with those results in frequentist statistics based on the calculation of a standard deviation. I have some questions about this.
About cosmology:
There are two main type of statistics used in physics: frequentist and Bayesian. The one most commonly used in cosmology is Bayesian statistics, which Planck also states that they are using at the beginning of page 4.
Planck paper:​
As in PCP13 we adopt a Bayesian framework for testing theoretical models.​
In the Planck paper, pages 38-40, equations 45 through 52, there are values given for parameters with error ranges and confidence levels. Here is an example.
K=0.000±0.005 (95%,PlanckTT+lowP+lensing+BAO). (50)​

About QM:
Post #17​
These are not only philosophical points of view, but from these interpretations come different methods of inferences.​
For example, the debate that took place during the discovery of the Higgs Bosons is about the difference between frequentist and Bayesian inferences: http://tonyohagan.co.uk/academic/pdf/HiggsBoson.pdf

I would much appreiate it if someone would post a simple example of how Bayesian statistics calculates an error range and confidence level. Before I retired I had some occasions to use frequentist statistics to create models for data mining, so I am familiar with how to calculate an error range and confidence level using frequentist statistics, but I have no experience or familiarity for using Bayesian statistics for this purpose.

Regards,
Buzz
 

Stephen Tashi

Science Advisor
6,624
1,020
I would much appreiate it if someone would post a simple example of how Bayesian statistics calculates an error range and confidence level.
Before you consider examples, you should consider definitions and completely understand what "confidence interval" and "error range" mean in frequentist statistics. They have a different meaning in Bayesian statistics. The fact you know how to calculate confidence intervals in frequentist statistics doesn't necessarily prepare you for understanding the definition of what you are calculating.
 

Buzz Bloom

Gold Member
2,035
330
Before you consider examples, you should consider definitions and completely understand what "confidence interval" and "error range" mean in frequentist statistics. They have a different meaning in Bayesian statistics. The fact you know how to calculate confidence intervals in frequentist statistics doesn't necessarily prepare you for understanding the definition of what you are calculating.
Hi Stephen:

I thank you for educating me that the meaning of the terms is different in Bayesian statistics than in frequentist statistics. I believe I do understand the meanings of "confidence interval" and "error range" in frequentist statistics. Take the example I gave quoting Equation 50, even thought that example related to cosmology. The "95%" confidence level corresponds to approximately a 2σ value assuming a Gaussian distribution. The error function
err(2) = 0.4773.​
That means
2×0.4773 = 0.9546 ~= 95%.​
This means that the +/-0.005 corresponds to
σ ~= 0.0025.​
This seems to me to be the same meaning it has in frequentist statistics. However, I could be mistaken.

The articles
and
(if I correctly understood what I read) is consistent with my example description above. The article
did not use the words "confidence" or "error", so this leads me to assume that the concepts of "confidence interval" and "error range" are NOT part of the Bayesian methods. However, the use of Bayesian methods in the Planck article produced values corresponding to these concepts.

The question for which I am seeking an example is:
How can the values of these two concepts get calculated if the method used is Bayesian?​

If the meaning of these concepts is different in Bayesian and frequentist methods, I do not know where to look for an explanation of that. Can you help me with that?

Regards,
Buzz
 

Orodruin

Staff Emeritus
Science Advisor
Homework Helper
Insights Author
Gold Member
2018 Award
15,633
5,658
Cosmologists frequently abuse nomenclature using ”confidence region” or ”confidence level” when they are actually referring to ”credible region” or ”integrated probability”. If you are not familiar with the particular use in the field it is easy to be led astray.
 

Stephen Tashi

Science Advisor
6,624
1,020
This seems to me to be the same meaning it has in frequentist statistics.
You've given an example of calculating some numbers without stating any interpretation for what those numbers represent. What does a "95% confidence interval" mean to you?
 

Buzz Bloom

Gold Member
2,035
330
Cosmologists frequently abuse nomenclature using ”confidence region” or ”confidence level” when they are actually referring to ”credible region” or ”integrated probability”. If you are not familiar with the particular use in the field it is easy to be led astray.
Hi Orodruin:

"Integrated probability" seems similar to "Gaussian error function", except that the distribution, as I understand it, is not Gaussian, and may not even be symmetric. Aside from that, can you please explain in what way the concept is different than the one I described in my post #3?

Regards,
Buzz
 

Buzz Bloom

Gold Member
2,035
330
You've given an example of calculating some numbers without stating any interpretation for what those numbers represent. What does a "95% confidence interval" mean to you?
Hi Stephen:

I am familiar with two ways to express this. Both ways are OK as I understand the concepts. I am using the same example of a mean value of zero for Ωk with a +/- range 0f 0.005 as described in my post #3.
1. The data from which the probabilities are calculated are consistent with a 95% certainty that the parameter value of interest, Ωk, satisfies
-0.005 < Ωk < +0.005 .​
2. I have reason to believe that
-0.005 < Ωk < +0.005 .​
is a true statement with a 95% confidence that this belief is correct.

If you prefer a different way to express the concept, please post it.

Regards,
Buzz
 

FactChecker

Science Advisor
Gold Member
2018 Award
4,877
1,674
1. The data from which the probabilities are calculated are consistent with a 95% certainty that the
2. I have reason to believe that
-0.005 < Ωk < +0.005 .​
is a true statement with a 95% confidence that this belief is correct.
The phrases "95% certainty" and "95% confidence" are intentionally vague. Because any probabilities are dependent on parameter assumptions that are undeniably incorrect (the assumed parameter values are almost certainly inexact and may be very wrong). It is left up to the user to apply the results judiciously.
 

Buzz Bloom

Gold Member
2,035
330
The phrases "95% certainty" and "95% confidence" are intentionally vague. Because any probabilities are dependent on parameter assumptions that are undeniably incorrect (the assumed parameter values are almost certainly inexact and may be very wrong). It is left up to the user to apply the results judiciously.
Hi FactChecker:

Thank you for your explanation. I should have taken the concept of conditional probability into account.
Let me try again.

The two renditions of the concept of "confidence interval" in my post #7 is based on frequentist statistics concepts. For Bayesian statistics I need to include the concept of conditional probability.

1. If I assume that certain conditional probability priors are reasonable choices, then the data from which the probabilities are calculated are consistent with a 95% certainty that the parameter value of interest, Ωk, satisfies
-0.005 < Ωk < +0.005 .​
2. If I assume that certain conditional probability priors are reasonable choices, then I have reason to believe that
-0.005 < Ωk < +0.005 .​
is a true statement with a 95% confidence that this belief is correct.

Is that better? If so, I would really appreciate seeing a simple example of how Bayesian statistics can be used to calculate values of mean and standard deviation for a single parameter based on a collection of conditional probabilities and priors.

Regards,
Buzz
 

FactChecker

Science Advisor
Gold Member
2018 Award
4,877
1,674
As long as it is clear that both approaches require a realistic understanding of the assumptions involved and are used judiciously. Neither can give a mathematically rigorous answer in the form of a probability unless some bold assumptions are made.
 
27,837
4,291
I would much appreiate it if someone would post a simple example of how Bayesian statistics calculates an error range and confidence level.
Probably the simplest example is for estimating the probability of success for a series of Bernoulli trials (e.g. estimating how fair a coin is by flipping it a lot).

For the Bayesian approach it is pretty straightforward. If you have ##n## trials with ##s## successes and ##f=n-s## failures then you describe the rate of success as a random variable ##p\sim Beta(s+1,f+1)##. Since you have a probability distribution for the success rate ##p## then you can do everything that you can do with probability distributions, including finding the mean, estimating quantiles, and calculating interval probabilities.

For example if you do 30 trials and get 12 successes and 18 failures, then you would have ##p\sim Beta(13,19)## which is plotted below. The mean of this distribution is 0.41 and the 0.025 quantile is 0.25 and the 0.975 quantile is 0.58. This means that there is a 0.95 probability that the success rate is between 0.25 and 0.58.
beta.png


For the frequentist approach it is less straightforward. You can find an estimator ##\hat p = s/n## of the unknown true probability ##p## which is not treated as a random variable. Because ##p## is not a random variable you cannot take its mean, but you can assume values for ##p## and use the Binomial distribution to predict how likely you are to see ##s## successes in ##n## trials and therefore you can get a likelihood function for ##\hat p##. When you do that then you can estimate the probability of getting a ##\hat p## value that is more extreme than the observed value under any given hypothesis, such as the hypothesis that ##p=0.5##. A 95% confidence interval is then an interval that should contain 95% of the estimates of ##\hat p## under that assumption if you repeated the exact experiment many many times.

For the above example, ##\hat p = 0.40## and the 95% confidence interval is ##[0.22,0.58]##, which is pretty similar to the Bayesian estimate.
 

Stephen Tashi

Science Advisor
6,624
1,020
1. The data from which the probabilities are calculated are consistent with a 95% certainty that the parameter value of interest, Ωk, satisfies
-0.005 < Ωk < +0.005 .​
2. I have reason to believe that
-0.005 < Ωk < +0.005 .​
is a true statement with a 95% confidence that this belief is correct.

The language of mathematical probability theory includes the concepts of a random variable, a sample space of events, and the probability of events in the sample space. The terms "consistent" and "certainty" don't have standard definitions.

The concept of "confidence" does have a standard definition but only in specific contexts, such as "confidence interval". The clearest way to write about problems in probability and statistics is not the most pleasing from a literary point of view. From a literary point of view, repeating the same phrase isn't good style. So instead of "0.95 probability of....", it sounds more pleasing to use a variety of expressions such as "95 percent confidence in... ", "95 percent certainty of ...", "95 percent frequency of..." etc. However, people who who use a variety of phrases in place of "probability of..." end up writing statements that are unclear or mathematically plain wrong because they incorporate concepts such as "confidence in..." and "frequency of ..." that have technical definitions different that "probabilty of ...".

The general scenario for "confidence interval" is that we have a random variable ##X## and we have an algorithm that takes data from a sample of ##X## as input and outputs the endpoints of an interval ##[a,b]##. Let ##W## be a number - usually it will be a parameter associated with the distribution of ##X##, such as the mean value of the distribution.

One definition of a "##[a,b]## is a 95 percent confidence interval for ##W##" is that there is a probability of 0.95 that ##W## is in [a,b]. In this definition of "confidence interval", the interval ##[a,b]## is defined by two random variables ##a##, ##b##. These are random variables because the values of ##a## and ##b## are computed from the values in a random sample. The value ##W## is represented as variable, not as a specific number. However ##W## is not a random variable. We do not consider ##W## to have any associated probability distribution. This sort of "95 percent confidence interval" is not one specfic interval with given numerical endpoints. Instead, such confidence interval is given by a formula that contains random variables.

For example, let ##W## be the uknown mean of the distribution of ##X##. Let the algorithm produce the output: ##a = ## the mean of the sample minus twice the standard deviation of the sample, ##b =## the mean of the sample plus twice the standard deviation of the sample. The claim "##[a,b]## is a 95 percent confidence interval for ##W##" is a claim that there is 0.95 probability that the random interval ##[a,b]## contains the unknown value ##W##.

A different definition of "confidence interval" is used when an algorithm is applied to a specific sample and produces specific numerical outputs, such as ##[ 0.315 -0.013, 0.315 +0.013]##. Many people call such a specific interval a "confidence interval". This leads to the common misunderstanding that one may make the claim "There is an 0.95 probability that ##W## is in ##[0.315 - 0.013, 0.315 + 0.013]## when this interval is the result of one application of the formula for a 95% confidence interval for ##W## to the data in one particular sample.

In the above scenario, it is a mathematical contradiction to begin by assuming that ##W## is not a random variable and then assert something about "the probability that W...." unless we are making a statement that involves random variables. We can sensibly talk about a probability for the event "##W## is in ##[a,b]##" since ##a## and ##b## are random variables and have associated probability distributions. However, there are no random variables involved in the statement "##W## is in ##[0.315-0.013, 0.315+0.013]##".

If you find it irresistible to think about ##W## as a quantity that can have a probability of being in ##[0.315-0.013, 0.315+0.013]## then you must use a Bayesian approach. In that approach, we do assume ##W## is a random variable and so it does have a probability distribution.
 
Last edited:

Buzz Bloom

Gold Member
2,035
330
Neither can give a mathematically rigorous answer in the form of a probability unless some bold assumptions are made.
Hi FactChecker:

Your comments are quite helpful, except I am a bit confused by "bold". Can you elaborate about what makes the needed assumptions bold?

Regards,
Buzz
 

Buzz Bloom

Gold Member
2,035
330
Probably the simplest example is for estimating the probability of success for a series of Bernoulli trials (e.g. estimating how fair a coin is by flipping it a lot).
For the Bayesian approach it is pretty straightforward. If you have n trials with s successes and f=n−s failures then you describe the rate of success as a random variable p∼Beta(s+1,f+1).
Hi Dale:

Thank you for your example. What I am missing is what makes the method of a series of trials Bayesian. Perhaps I misunderstand, but my understanding of the Bayesian method involves conditional probabilities and priors. Your example had neither of these concepts.

I am also confused by the use of the Beta function:
Beta(s+1,f+1) = s! f! / (s+f+1)! .​
Properties section.​
I would have thought you want the reciprocal of the combinations distribution
1/Comb(s,f) = s! f! / (s+f)! .​
I get that for large n these two are approximately the same, but I do not understand the preference for the Beta form.

Regards,
Buzz
 
Last edited:

FactChecker

Science Advisor
Gold Member
2018 Award
4,877
1,674
Can you elaborate about what makes the needed assumptions bold?
Perhaps "bold" is too strong a word, but I was feeling bold when I wrote that. The Bayesian approach requires a prior distribution that comes either from prior data (which is assumed to be reasonably typical) or from reasonable subject knowledge. I don't think that there is a way around that. Some people start with a uniform prior distribution. I don't think that there is a good way to turn the frequentist approach into one where the results can be simply given as a probability. Either way, the results should be used carefully.
 
27,837
4,291
the Bayesian method involves conditional probabilities and priors. Your example had neither of these concepts
Sorry, I tried to avoid “jargon” and perhaps overdid it. I used an uninformative prior (a uniform distribution=Beta(1,1)). The plotted distribution is the posterior probability which is the conditional probability of the hypothesis given the data.

I am also confused by the use of the Beta function
It is not the beta function, it is the beta distribution: https://en.m.wikipedia.org/wiki/Beta_distribution

The Beta distribution is the conjugate prior for a Bernoulli trial.
 

Buzz Bloom

Gold Member
2,035
330
The general scenario for "confidence interval" is that we have a random variable X and we have an algorithm that takes data from a sample of X as input and outputs the endpoints of an interval [a,b]. Let W be a number - usually it will be a parameter associated with the distribution of X, such as the mean value of the distribution.
Hi Steven:

Thank you for your excellent explanation for the "confidence interval" concept.

With respect to the quote above I have a a different understanding of the usage of "random variable" and "parameter". With respect to the Planck paper, the parameters are the physical variables:
h0, Ωr, Ωm, Ωk, and ΩΛ,​
for each of which a mean and error range and confidence interval are to be calculated from a collection of physical measurements. I understand that the Ωr parameter is an exception, and it is not calculated by the same "best fit" methods as the others. Also, these parameters are not independent, the sum of the four Ω parameters is defined to be 1.

These parameters are not random variables. They are unknowns to be estimated by some form of best fit methods. I am quite familiar with several frequentist methods for best fit calculations. What I know nothing about is the Bayesian method (involving conditional probabilities and priors) for making these best fit calculations. What I was hoping to learn (by means of a simple example) is how conditional probabilities and priors might be used to calculate a best fit.

Regards,
Buzz
 
27,837
4,291
These parameters are not random variables. They are unknowns to be estimated by some form of best fit methods.
In frequentist statistics they are not random variables. In Bayesian statistics they are random variables. With Bayesian methods you simply construct their joint posterior distribution, conditioned on the data. That is the equivalent of the frequentist fitting methods.
 

Stephen Tashi

Science Advisor
6,624
1,020
These parameters are not random variables. They are unknowns to be estimated by some form of best fit methods.
However the best fit methods are applied to data that is assumed to have random errors. So the measured data and anything computed from it is treated as a random variable. The physical parameters correspond to the unknown ##W## in the previous post. Sampling random variables produces the data about ##W##.
 

Buzz Bloom

Gold Member
2,035
330
Hi @Stephen Tashi, @Orodruin, @FactChecker, and @Dale:

I want to thank you again for all your help in educating me about my question in the OP. One thing in particular I learned is that Bayesian statistics is a great deal more complicated than I thought it would be. My original understanding was limited to just Bayes Theorem.

I still feel I have failed to learn what my original question was seeking. I would like to try one more example, this one involving a best fit problem from 1929.

OrigHubbleGalaxies.png

The figure has 24 dots, each representing a distance D and a velocity V corresponding to a galaxy. The straight line represent a simple form of the Hubble equation:
V = h0 D .​
Presumable h0 is calculated by a suitable best fit method minimizing the sum S of the squares of the 24 error velocities:
S = ∑[i=1 to 24] (Vi - H0D)2 .​
One method of doing this is to calculate the value h0 which satisfies the equation
dS/dh0 = 0 .​
That would give a value for h0 as a mean of a distribution of possible h0 values. However, additional work is needed to get a value for a standard deviation. I have a method in mind for how this might be done, but it is somewhat complicated, and I think I should it omit it from the discussion unless someone would particularly like to see it. I interpret the method I have in mind for calculating the standard deviation as frequentist. The method I described above as producing a mean I do not think of as involving any statistics.

How would a calculation be made using Bayesian methods leading to a value for a mean and standard deviation for a distribution of possible h0 values? What would the conditional probabilities and priors be?

Regards,
Buzz
 
Last edited:

Stephen Tashi

Science Advisor
6,624
1,020
One method of doing this is to calculate the value h0 which satisfies the equation
dS/dh0 = 0 .​
That would give a value for h0 as a mean of a distribution of possible h0 values.
Why does it give a mean value of ##h_0##? Applying the least squares fitting procedure to a set of data doesn't produce other "possible ##h_0## values". So there isn't any distribution of ##h_0## values.

A difficulty in converting this example to a Bayesian confidence interval example is that, as example of frequentist, reasoning, it is complicated. If you think about it, the fitting procedure will produce one single value for ##h_0##. So how is it possible for ##h_0## to have any standard deviation? How do we attribute a standard deviation to a single number? Computer software that outputs standard deviations for parameters obtained from fits must use various assumptions and approximations to estimate a standard deviation for ##h_0##.

The conceptually simplest method is to use estimated statistical properties of the data to generate simulated data and use the varying values of ##h_0## obtained by fits to the simulated data to estimate the variability of ##h_0##. I would call this a Bayesian approach because assumptions about the statistical properties of the data implictly determine a statistical distribution for ##h_0##.

A different method is to make the assumptions that allow the use of a "linear asymptotic" estimator for ##h_0##. I don't know if pure frequentist reasoning can be used to justify that technique.
 
Last edited:

Orodruin

Staff Emeritus
Science Advisor
Homework Helper
Insights Author
Gold Member
2018 Award
15,633
5,658
A difficulty in converting this example to a Bayesian confidence interval example is that, as example of frequentist, reasoning, it is complicated.
Another is that Bayesian statistics produce credible intervals, not confidence intervals, which are inherently frequentist.
 

FactChecker

Science Advisor
Gold Member
2018 Award
4,877
1,674
Another is that Bayesian statistics produce credible intervals, not confidence intervals, which are inherently frequentist.
At least as credible as the prior distribution, as modified by some amount of data, can be. A wrong prior can require a great amount of data to correct.
 

Orodruin

Staff Emeritus
Science Advisor
Homework Helper
Insights Author
Gold Member
2018 Award
15,633
5,658
At least as credible as the prior distribution, as modified by some amount of data, can be. A wrong prior can require a great amount of data to correct.
Don’t blame me, I did not invent the nomenclature ...
 

Buzz Bloom

Gold Member
2,035
330
How do we attribute a standard deviation to a single number?
The conceptually simplest method is to use estimated statistical properties of the data to generate simulated data and use the varying values of h0 obtained by fits to the simulated data to estimate the variability of h0.
Hi Steven:

The method I have in mind shares some of the ideas you suggest for generated simulated data. Specifically, the method repeats the following a relatively large number of times.
1. Randomly select a fraction of the data points, say for example, one half.
2. Find the value of h0 which corresponds to the least mean square fit.
Each of these h0 values is a random variable, and the collection can be used to calculate a mean and standard deviation.

I would expect this mean to be close to the previously calculated value using all of the data. I may be mistaken, but I think it likely that the original value using all the data is a better estimate of the best value to use than the second value. It is in this sense that I used the term "mean" before the distribution was generated. One question I am not certain about is the best number of iterations/trials to use for the calculation of the standard deviation. My guess is that the same number as the original set of data points is a good choice. I would also add the square of the difference between the two mean values to the square of the calculated standard deviation.

I would welcome your opinion.

Regards,
Buzz
 
Last edited:

Want to reply to this thread?

"Questions about error range from Bayesian statistics" You must log in or register to reply here.

Physics Forums Values

We Value Quality
• Topics based on mainstream science
• Proper English grammar and spelling
We Value Civility
• Positive and compassionate attitudes
• Patience while debating
We Value Productivity
• Disciplined to remain on-topic
• Recognition of own weaknesses
• Solo and co-op problem solving
Top