Questions about error range from Bayesian statistics

Buzz Bloom · Jun 13, 2019

About cosmology:

https://www.physicsforums.com/threa...verse-is-absolutely-flat.971984/#post-6180036

Orodruin said:

There are two main type of statistics used in physics: frequentist and Bayesian. The one most commonly used in cosmology is Bayesian statistics, which Planck also states that they are using at the beginning of page 4.

Planck paper:

https://arxiv.org/pdf/1502.01589.pdf

As in PCP13 we adopt a Bayesian framework for testing theoretical models.

In the Planck paper, pages 38-40, equations 45 through 52, there are values given for parameters with error ranges and confidence levels. Here is an example.

Ω_K=0.000±0.005 (95%,PlanckTT+lowP+lensing+BAO). (50)

About QM:

https://www.physicsforums.com/threads/the-philosophy-of-quantum-mechanics.972634/

Post #17

These are not only philosophical points of view, but from these interpretations come different methods of inferences.

For example, the debate that took place during the discovery of the Higgs Bosons is about the difference between frequentist and Bayesian inferences: http://tonyohagan.co.uk/academic/pdf/HiggsBoson.pdf

I would much appreiate it if someone would post a simple example of how Bayesian statistics calculates an error range and confidence level. Before I retired I had some occasions to use frequentist statistics to create models for data mining, so I am familiar with how to calculate an error range and confidence level using frequentist statistics, but I have no experience or familiarity for using Bayesian statistics for this purpose.

Regards,
Buzz

Stephen Tashi · Jun 13, 2019

Buzz Bloom said:

I would much appreiate it if someone would post a simple example of how Bayesian statistics calculates an error range and confidence level.

Before you consider examples, you should consider definitions and completely understand what "confidence interval" and "error range" mean in frequentist statistics. They have a different meaning in Bayesian statistics. The fact you know how to calculate confidence intervals in frequentist statistics doesn't necessarily prepare you for understanding the definition of what you are calculating.

Buzz Bloom · Jun 13, 2019

Stephen Tashi said:

Before you consider examples, you should consider definitions and completely understand what "confidence interval" and "error range" mean in frequentist statistics. They have a different meaning in Bayesian statistics. The fact you know how to calculate confidence intervals in frequentist statistics doesn't necessarily prepare you for understanding the definition of what you are calculating.

Hi Stephen:

I thank you for educating me that the meaning of the terms is different in Bayesian statistics than in frequentist statistics. I believe I do understand the meanings of "confidence interval" and "error range" in frequentist statistics. Take the example I gave quoting Equation 50, even thought that example related to cosmology. The "95%" confidence level corresponds to approximately a 2σ value assuming a Gaussian distribution. The error function

err(2) = 0.4773.

That means

2×0.4773 = 0.9546 ~= 95%.

This means that the +/-0.005 corresponds to

σ ~= 0.0025.

This seems to me to be the same meaning it has in frequentist statistics. However, I could be mistaken.

The articles

https://en.wikipedia.org/wiki/Frequentist_inference

and

https://en.wikipedia.org/wiki/Confidence_interval

(if I correctly understood what I read) is consistent with my example description above. The article

https://en.wikipedia.org/wiki/Bayesian_statistics#Bayesian_inference

did not use the words "confidence" or "error", so this leads me to assume that the concepts of "confidence interval" and "error range" are NOT part of the Bayesian methods. However, the use of Bayesian methods in the Planck article produced values corresponding to these concepts.

The question for which I am seeking an example is:

How can the values of these two concepts get calculated if the method used is Bayesian?

If the meaning of these concepts is different in Bayesian and frequentist methods, I do not know where to look for an explanation of that. Can you help me with that?

Regards,
Buzz

Orodruin · Jun 13, 2019

Cosmologists frequently abuse nomenclature using ”confidence region” or ”confidence level” when they are actually referring to ”credible region” or ”integrated probability”. If you are not familiar with the particular use in the field it is easy to be led astray.

Stephen Tashi · Jun 13, 2019

Buzz Bloom said:

This seems to me to be the same meaning it has in frequentist statistics.

You've given an example of calculating some numbers without stating any interpretation for what those numbers represent. What does a "95% confidence interval" mean to you?

Buzz Bloom · Jun 13, 2019

Orodruin said:

Cosmologists frequently abuse nomenclature using ”confidence region” or ”confidence level” when they are actually referring to ”credible region” or ”integrated probability”. If you are not familiar with the particular use in the field it is easy to be led astray.

Hi Orodruin:

"Integrated probability" seems similar to "Gaussian error function", except that the distribution, as I understand it, is not Gaussian, and may not even be symmetric. Aside from that, can you please explain in what way the concept is different than the one I described in my post #3?

Regards,
Buzz

Buzz Bloom · Jun 13, 2019

Stephen Tashi said:

You've given an example of calculating some numbers without stating any interpretation for what those numbers represent. What does a "95% confidence interval" mean to you?

Hi Stephen:

I am familiar with two ways to express this. Both ways are OK as I understand the concepts. I am using the same example of a mean value of zero for Ω_k with a +/- range 0f 0.005 as described in my post #3.
1. The data from which the probabilities are calculated are consistent with a 95% certainty that the parameter value of interest, Ω_k, satisfies

-0.005 < Ω_k < +0.005 .

2. I have reason to believe that

-0.005 < Ω_k < +0.005 .

is a true statement with a 95% confidence that this belief is correct.

If you prefer a different way to express the concept, please post it.

Regards,
Buzz

FactChecker · Jun 13, 2019

Buzz Bloom said:

1. The data from which the probabilities are calculated are consistent with a 95% certainty that the
2. I have reason to believe that

-0.005 < Ω_k < +0.005 .
is a true statement with a 95% confidence that this belief is correct.

The phrases "95% certainty" and "95% confidence" are intentionally vague. Because any probabilities are dependent on parameter assumptions that are undeniably incorrect (the assumed parameter values are almost certainly inexact and may be very wrong). It is left up to the user to apply the results judiciously.

Buzz Bloom · Jun 14, 2019

FactChecker said:

The phrases "95% certainty" and "95% confidence" are intentionally vague. Because any probabilities are dependent on parameter assumptions that are undeniably incorrect (the assumed parameter values are almost certainly inexact and may be very wrong). It is left up to the user to apply the results judiciously.

Hi FactChecker:

Thank you for your explanation. I should have taken the concept of conditional probability into account.
Let me try again.

The two renditions of the concept of "confidence interval" in my post #7 is based on frequentist statistics concepts. For Bayesian statistics I need to include the concept of conditional probability.

1. If I assume that certain conditional probability priors are reasonable choices, then the data from which the probabilities are calculated are consistent with a 95% certainty that the parameter value of interest, Ω_k, satisfies

-0.005 < Ω_k < +0.005 .

2. If I assume that certain conditional probability priors are reasonable choices, then I have reason to believe that

-0.005 < Ω_k < +0.005 .

is a true statement with a 95% confidence that this belief is correct.

Is that better? If so, I would really appreciate seeing a simple example of how Bayesian statistics can be used to calculate values of mean and standard deviation for a single parameter based on a collection of conditional probabilities and priors.

Regards,
Buzz

FactChecker · Jun 14, 2019

As long as it is clear that both approaches require a realistic understanding of the assumptions involved and are used judiciously. Neither can give a mathematically rigorous answer in the form of a probability unless some bold assumptions are made.

Dale · Jun 14, 2019

Buzz Bloom said:

I would much appreiate it if someone would post a simple example of how Bayesian statistics calculates an error range and confidence level.

Probably the simplest example is for estimating the probability of success for a series of Bernoulli trials (e.g. estimating how fair a coin is by flipping it a lot).

For the Bayesian approach it is pretty straightforward. If you have ##n## trials with ##s## successes and ##f=n-s## failures then you describe the rate of success as a random variable ##p\sim Beta(s+1,f+1)##. Since you have a probability distribution for the success rate ##p## then you can do everything that you can do with probability distributions, including finding the mean, estimating quantiles, and calculating interval probabilities.

For example if you do 30 trials and get 12 successes and 18 failures, then you would have ##p\sim Beta(13,19)## which is plotted below. The mean of this distribution is 0.41 and the 0.025 quantile is 0.25 and the 0.975 quantile is 0.58. This means that there is a 0.95 probability that the success rate is between 0.25 and 0.58.

For the frequentist approach it is less straightforward. You can find an estimator ##\hat p = s/n## of the unknown true probability ##p## which is not treated as a random variable. Because ##p## is not a random variable you cannot take its mean, but you can assume values for ##p## and use the Binomial distribution to predict how likely you are to see ##s## successes in ##n## trials and therefore you can get a likelihood function for ##\hat p##. When you do that then you can estimate the probability of getting a ##\hat p## value that is more extreme than the observed value under any given hypothesis, such as the hypothesis that ##p=0.5##. A 95% confidence interval is then an interval that should contain 95% of the estimates of ##\hat p## under that assumption if you repeated the exact experiment many many times.

For the above example, ##\hat p = 0.40## and the 95% confidence interval is ##[0.22,0.58]##, which is pretty similar to the Bayesian estimate.

Stephen Tashi · Jun 14, 2019

Buzz Bloom said:

1. The data from which the probabilities are calculated are consistent with a 95% certainty that the parameter value of interest, Ω_k, satisfies

-0.005 < Ω_k < +0.005 .
2. I have reason to believe that

-0.005 < Ω_k < +0.005 .
is a true statement with a 95% confidence that this belief is correct.

The language of mathematical probability theory includes the concepts of a random variable, a sample space of events, and the probability of events in the sample space. The terms "consistent" and "certainty" don't have standard definitions.

The concept of "confidence" does have a standard definition but only in specific contexts, such as "confidence interval". The clearest way to write about problems in probability and statistics is not the most pleasing from a literary point of view. From a literary point of view, repeating the same phrase isn't good style. So instead of "0.95 probability of...", it sounds more pleasing to use a variety of expressions such as "95 percent confidence in... ", "95 percent certainty of ...", "95 percent frequency of..." etc. However, people who who use a variety of phrases in place of "probability of..." end up writing statements that are unclear or mathematically plain wrong because they incorporate concepts such as "confidence in..." and "frequency of ..." that have technical definitions different that "probabilty of ...".

The general scenario for "confidence interval" is that we have a random variable ##X## and we have an algorithm that takes data from a sample of ##X## as input and outputs the endpoints of an interval ##[a,b]##. Let ##W## be a number - usually it will be a parameter associated with the distribution of ##X##, such as the mean value of the distribution.

One definition of a "##[a,b]## is a 95 percent confidence interval for ##W##" is that there is a probability of 0.95 that ##W## is in [a,b]. In this definition of "confidence interval", the interval ##[a,b]## is defined by two random variables ##a##, ##b##. These are random variables because the values of ##a## and ##b## are computed from the values in a random sample. The value ##W## is represented as variable, not as a specific number. However ##W## is not a random variable. We do not consider ##W## to have any associated probability distribution. This sort of "95 percent confidence interval" is not one specfic interval with given numerical endpoints. Instead, such confidence interval is given by a formula that contains random variables.

For example, let ##W## be the uknown mean of the distribution of ##X##. Let the algorithm produce the output: ##a = ## the mean of the sample minus twice the standard deviation of the sample, ##b =## the mean of the sample plus twice the standard deviation of the sample. The claim "##[a,b]## is a 95 percent confidence interval for ##W##" is a claim that there is 0.95 probability that the random interval ##[a,b]## contains the unknown value ##W##.

A different definition of "confidence interval" is used when an algorithm is applied to a specific sample and produces specific numerical outputs, such as ##[ 0.315 -0.013, 0.315 +0.013]##. Many people call such a specific interval a "confidence interval". This leads to the common misunderstanding that one may make the claim "There is an 0.95 probability that ##W## is in ##[0.315 - 0.013, 0.315 + 0.013]## when this interval is the result of one application of the formula for a 95% confidence interval for ##W## to the data in one particular sample.

In the above scenario, it is a mathematical contradiction to begin by assuming that ##W## is not a random variable and then assert something about "the probability that W..." unless we are making a statement that involves random variables. We can sensibly talk about a probability for the event "##W## is in ##[a,b]##" since ##a## and ##b## are random variables and have associated probability distributions. However, there are no random variables involved in the statement "##W## is in ##[0.315-0.013, 0.315+0.013]##".

If you find it irresistible to think about ##W## as a quantity that can have a probability of being in ##[0.315-0.013, 0.315+0.013]## then you must use a Bayesian approach. In that approach, we do assume ##W## is a random variable and so it does have a probability distribution.

Buzz Bloom · Jun 16, 2019

FactChecker said:

Neither can give a mathematically rigorous answer in the form of a probability unless some bold assumptions are made.

Hi FactChecker:

Your comments are quite helpful, except I am a bit confused by "bold". Can you elaborate about what makes the needed assumptions bold?

Regards,
Buzz

Buzz Bloom · Jun 16, 2019

Dale said:

Probably the simplest example is for estimating the probability of success for a series of Bernoulli trials (e.g. estimating how fair a coin is by flipping it a lot).
For the Bayesian approach it is pretty straightforward. If you have n trials with s successes and f=n−s failures then you describe the rate of success as a random variable p∼Beta(s+1,f+1).

Hi Dale:

Thank you for your example. What I am missing is what makes the method of a series of trials Bayesian. Perhaps I misunderstand, but my understanding of the Bayesian method involves conditional probabilities and priors. Your example had neither of these concepts.

I am also confused by the use of the Beta function:

Beta(s+1,f+1) = s! f! / (s+f+1)! .

https://en.wikipedia.org/wiki/Beta_function

Properties section.

I would have thought you want the reciprocal of the combinations distribution

1/Comb(s,f) = s! f! / (s+f)! .

I get that for large n these two are approximately the same, but I do not understand the preference for the Beta form.

Regards,
Buzz

FactChecker · Jun 16, 2019

Buzz Bloom said:

Can you elaborate about what makes the needed assumptions bold?

Perhaps "bold" is too strong a word, but I was feeling bold when I wrote that. The Bayesian approach requires a prior distribution that comes either from prior data (which is assumed to be reasonably typical) or from reasonable subject knowledge. I don't think that there is a way around that. Some people start with a uniform prior distribution. I don't think that there is a good way to turn the frequentist approach into one where the results can be simply given as a probability. Either way, the results should be used carefully.

Dale · Jun 16, 2019

Buzz Bloom said:

the Bayesian method involves conditional probabilities and priors. Your example had neither of these concepts

Sorry, I tried to avoid “jargon” and perhaps overdid it. I used an uninformative prior (a uniform distribution=Beta(1,1)). The plotted distribution is the posterior probability which is the conditional probability of the hypothesis given the data.

Buzz Bloom said:

I am also confused by the use of the Beta function

It is not the beta function, it is the beta distribution: https://en.m.wikipedia.org/wiki/Beta_distribution

The Beta distribution is the conjugate prior for a Bernoulli trial.

Buzz Bloom · Jun 16, 2019

Stephen Tashi said:

The general scenario for "confidence interval" is that we have a random variable X and we have an algorithm that takes data from a sample of X as input and outputs the endpoints of an interval [a,b]. Let W be a number - usually it will be a parameter associated with the distribution of X, such as the mean value of the distribution.

Hi Steven:

Thank you for your excellent explanation for the "confidence interval" concept.

With respect to the quote above I have a a different understanding of the usage of "random variable" and "parameter". With respect to the Planck paper, the parameters are the physical variables:

h₀, Ω_r, Ω_m, Ω_k, and Ω_Λ,

for each of which a mean and error range and confidence interval are to be calculated from a collection of physical measurements. I understand that the Ω_r parameter is an exception, and it is not calculated by the same "best fit" methods as the others. Also, these parameters are not independent, the sum of the four Ω parameters is defined to be 1.

These parameters are not random variables. They are unknowns to be estimated by some form of best fit methods. I am quite familiar with several frequentist methods for best fit calculations. What I know nothing about is the Bayesian method (involving conditional probabilities and priors) for making these best fit calculations. What I was hoping to learn (by means of a simple example) is how conditional probabilities and priors might be used to calculate a best fit.

Regards,
Buzz

Dale · Jun 16, 2019

Buzz Bloom said:

These parameters are not random variables. They are unknowns to be estimated by some form of best fit methods.

In frequentist statistics they are not random variables. In Bayesian statistics they are random variables. With Bayesian methods you simply construct their joint posterior distribution, conditioned on the data. That is the equivalent of the frequentist fitting methods.

Stephen Tashi · Jun 16, 2019

Buzz Bloom said:

These parameters are not random variables. They are unknowns to be estimated by some form of best fit methods.

However the best fit methods are applied to data that is assumed to have random errors. So the measured data and anything computed from it is treated as a random variable. The physical parameters correspond to the unknown ##W## in the previous post. Sampling random variables produces the data about ##W##.

Buzz Bloom · Jun 16, 2019

Hi @Stephen Tashi, @Orodruin, @FactChecker, and @Dale:

I want to thank you again for all your help in educating me about my question in the OP. One thing in particular I learned is that Bayesian statistics is a great deal more complicated than I thought it would be. My original understanding was limited to just Bayes Theorem.

I still feel I have failed to learn what my original question was seeking. I would like to try one more example, this one involving a best fit problem from 1929.

http://www.astro.caltech.edu/~george/ay127/readings/FreedmanMadore2010.pdf

The figure has 24 dots, each representing a distance D and a velocity V corresponding to a galaxy. The straight line represent a simple form of the Hubble equation:

V = h₀ D .

Presumable h₀ is calculated by a suitable best fit method minimizing the sum S of the squares of the 24 error velocities:

S = ∑_{[i=1 to 24]} (V_i - H₀D)² .

One method of doing this is to calculate the value h₀ which satisfies the equation

dS/dh₀ = 0 .

That would give a value for h₀ as a mean of a distribution of possible h₀ values. However, additional work is needed to get a value for a standard deviation. I have a method in mind for how this might be done, but it is somewhat complicated, and I think I should it omit it from the discussion unless someone would particularly like to see it. I interpret the method I have in mind for calculating the standard deviation as frequentist. The method I described above as producing a mean I do not think of as involving any statistics.

How would a calculation be made using Bayesian methods leading to a value for a mean and standard deviation for a distribution of possible h₀ values? What would the conditional probabilities and priors be?

Regards,
Buzz

Stephen Tashi · Jun 16, 2019

Buzz Bloom said:

One method of doing this is to calculate the value h₀ which satisfies the equation

dS/dh₀ = 0 .
That would give a value for h₀ as a mean of a distribution of possible h₀ values.

Why does it give a mean value of ##h_0##? Applying the least squares fitting procedure to a set of data doesn't produce other "possible ##h_0## values". So there isn't any distribution of ##h_0## values.

A difficulty in converting this example to a Bayesian confidence interval example is that, as example of frequentist, reasoning, it is complicated. If you think about it, the fitting procedure will produce one single value for ##h_0##. So how is it possible for ##h_0## to have any standard deviation? How do we attribute a standard deviation to a single number? Computer software that outputs standard deviations for parameters obtained from fits must use various assumptions and approximations to estimate a standard deviation for ##h_0##.

The conceptually simplest method is to use estimated statistical properties of the data to generate simulated data and use the varying values of ##h_0## obtained by fits to the simulated data to estimate the variability of ##h_0##. I would call this a Bayesian approach because assumptions about the statistical properties of the data implictly determine a statistical distribution for ##h_0##.

A different method is to make the assumptions that allow the use of a "linear asymptotic" estimator for ##h_0##. I don't know if pure frequentist reasoning can be used to justify that technique.

Orodruin · Jun 16, 2019

Stephen Tashi said:

A difficulty in converting this example to a Bayesian confidence interval example is that, as example of frequentist, reasoning, it is complicated.

Another is that Bayesian statistics produce credible intervals, not confidence intervals, which are inherently frequentist.

FactChecker · Jun 16, 2019

Orodruin said:

Another is that Bayesian statistics produce credible intervals, not confidence intervals, which are inherently frequentist.

At least as credible as the prior distribution, as modified by some amount of data, can be. A wrong prior can require a great amount of data to correct.

Orodruin · Jun 16, 2019

FactChecker said:

At least as credible as the prior distribution, as modified by some amount of data, can be. A wrong prior can require a great amount of data to correct.

Don’t blame me, I did not invent the nomenclature ...

Buzz Bloom · Jun 16, 2019

Stephen Tashi said:

How do we attribute a standard deviation to a single number?

Stephen Tashi said:

The conceptually simplest method is to use estimated statistical properties of the data to generate simulated data and use the varying values of h₀ obtained by fits to the simulated data to estimate the variability of h₀.

Hi Steven:

The method I have in mind shares some of the ideas you suggest for generated simulated data. Specifically, the method repeats the following a relatively large number of times.
1. Randomly select a fraction of the data points, say for example, one half.
2. Find the value of h₀ which corresponds to the least mean square fit.
Each of these h₀ values is a random variable, and the collection can be used to calculate a mean and standard deviation.

I would expect this mean to be close to the previously calculated value using all of the data. I may be mistaken, but I think it likely that the original value using all the data is a better estimate of the best value to use than the second value. It is in this sense that I used the term "mean" before the distribution was generated. One question I am not certain about is the best number of iterations/trials to use for the calculation of the standard deviation. My guess is that the same number as the original set of data points is a good choice. I would also add the square of the difference between the two mean values to the square of the calculated standard deviation.

I would welcome your opinion.

Regards,
Buzz

Stephen Tashi · Jun 16, 2019

Buzz Bloom said:

The method I have in mind shares some of the ideas you suggest for generated simulated data. Specifically, the method repeats the following a relatively large number of times.
1. Randomly select a fraction of the data points, say for example, one half.
2. Find the value of h₀ which corresponds to the least mean square fit.
Each of these h₀ values is a random variable, and the collection can be used to calculate a mean and standard deviation.

This approach is called "bootstrapping" and we can look up articles about it.

I would expect this mean to be close to the previously calculated value using all of the data. I may be mistaken, but I think it likely that the original value using all the data is a better estimate of the best value to use than the second value. It is in this sense that I used the term "mean" before the distribution was generated. One question I am not certain about is the best number of iterations/trials to use for the calculation of the standard deviation. My guess is that the same number as the original set of data points is a good choice. I would also add the square of the difference between the two mean values to the square of the calculated standard deviation.

The question of which estimators are better or best depends on the technical definition of "best". Among the possible interpretations of "best" are: minimum variance, unbiased, maximum liklihood, and best mean square. I don't know which, if any, of those criteria are met by the proposed bootstrap estimator. We can probably find an article about it on web.

Dale · Jun 16, 2019

Buzz Bloom said:

How would a calculation be made using Bayesian methods leading to a value for a mean and standard deviation for a distribution of possible h0 values? What would the conditional probabilities and priors be?

For a straight linear regression there is a conjugate prior, so you can do this analytically. It is described here:
https://en.m.wikipedia.org/wiki/Bayesian_linear_regression
The conjugate prior is normal for the regression coefficients and inverse-gamma for the error variance.

In practice, however, I think most people just use Marcov Chain Monte Carlo methods. That allows more flexibility in specifying both the prior and the model. So usually you can just specify a prior that makes sense, plug it and the data into the MCMC package of your choice and get your posterior distribution.

As far as how to specify the prior, that requires domain knowledge that I don’t have for this problem. If there were prior studies you might use those to generate a mean and a standard deviation and then use a prior with say double that standard deviation. Or if there haven’t been previous studies but some values are absurd then you could shape your prior accordingly. The point is that the prior should summarize all of the currently available information.

Questions about error range from Bayesian statistics

1. What is Bayesian statistics?

2. How is Bayesian statistics different from traditional statistics?

3. What is the error range in Bayesian statistics?

4. How do you determine the error range in Bayesian statistics?

5. What are some advantages of using Bayesian statistics?

Similar threads

Hot Threads

Recent Insights