Computing uncertainties in histogram bin counts

Click For Summary
SUMMARY

This discussion focuses on computing uncertainties in histogram bin counts derived from redshift values of quasars, specifically addressing how the uncertainties in redshift estimates propagate to bin counts. The conversation highlights the challenges of dealing with ill-posed problems in statistics, particularly when the mean of the underlying distribution is unknown. Two approaches are proposed: a simplistic method assuming the measurement as the mean and a Bayesian approach that incorporates prior distributions. The importance of understanding the distribution of bin counts, especially for small values, is emphasized, along with the need for rigorous proofs regarding standard deviations and confidence intervals.

PREREQUISITES
  • Understanding of Gaussian distributions and their properties
  • Familiarity with Bayesian statistics and prior distributions
  • Knowledge of binomial distributions and their convolution
  • Basic concepts of statistical estimation and variance calculation
NEXT STEPS
  • Research "Bayesian statistics for estimating parameters" to deepen understanding of prior distributions.
  • Explore "Confidence intervals for binomial distributions" to learn methods for small bin counts.
  • Study "Convolution of distributions" to understand how bin counts can be modeled.
  • Investigate "Statistical properties of histogram bin counts" for insights into variance and distribution behavior.
USEFUL FOR

Astrophysicists, statisticians, and data analysts working with redshift data and histogram analysis, particularly those interested in uncertainty quantification and statistical modeling.

Sunil Simha
Messages
266
Reaction score
2
I am working on astrophysical data and I have a large number of redshift values of quasars. Now, each redshift estimate comes with its estimated standard error naturally. If I plot a histogram of these redshifts, I would expect the bins counts to also have some sort of uncertainty.

I am unable to figure out how the uncertainties in the redshift estimates propagate to the bin counts. I also seem to be unable to phrase this properly and get material online that addresses this problem. Any suggestions as to how to go about computing the uncertainties?

It would also be a great help if I could get some references.
 
Physics news on Phys.org
I'll assume by "uncertainty" you refer to the standard deviation of a random variable. Then the work begins with defining that random variable.

If we consider the count C in the ##j##-th bin to be a random variable, then you have an ill-posed problem. You know the standard deviation of the measurement ##G_i## from the ##i##-th galaxy but you don't know the correct measurement for that galaxy, which is presumably the mean of the random variable ##G_i## that was sampled to produce the measurement. Not knowing the mean of ##G_i##, you can't calculate the probability that the realized value of ##G_i## falls in a particular bin. (i.e. You can't calculate the probability that ##G_i## is in the specific interval ##[a_j, b_j]## that defines the ##j##-th bin).

(As a simple analogy, consider this problem: Given that G has a normal distribution with standard deviation 1.3 and unknown mean, find the probability that a random sample from G has a value between 5 and 6.)

Ill-posed problems never stop statisticians who are determined to apply statistics. They add assumptions till the problem is better-posed. (A good example of this is the assumptions used in the problem of estimating the standard deviations of parameters that are produced by curve-fitting methods, such as regression.)

The simplistic way to make the problem well-posed is to say some words to the effect that you will assume the particular measurement ##g_i## of the random variable ##G_i## can be used as the mean of the distribution of ##G_i##. Then, assuming ##G_i## has a gaussian distribution, you can compute the probability that a random sample from ##G_i## lies in a particular interval.

A more complicated way to make the problem well-posed is to take a Bayesian approach and assume some prior distribution for the mean of ##G_i##. After some possibly not-simple calculations, the approach also allows you calculate the probability that a random sample from ##G_i## lies in a particular interval.

I prefer the Bayesian approach if the result of the analysis is to be used for some important decision. If the result of the analysis is just to fill-in-a-blank in a report then I'd used the simple approach.

Consider a particular bin, whose endpoints are [a_j,b_j]. Define the random variable ##c_i## to be 1 if the measurement of ##G_i## falls in the bin and ##c_i = 0## otherwise. So ##c_i## is a bernoulli random variable with distribution ##Pr(c_i = 1) = p_i,\ Pr(c_i= 0) = 1 - p_i## and we have made some assumptions that let us compute the numerical value of ##p_i##.

The variance of ##c_i## is ##p_i(1-p_i)##. The variance of the total count is the variance of ##C = \sum_{i=1}^N c_i## where the sum is taken over the ##N## measurements.

Assuming the measurements are independent, the variance of ##C## is the sum of the variances of the ##c_i## The standard deviation of ##C## is the square root of that variance.

( There is technically a difference between "calculating" a (population) variance and "estimating" that variance from sample data. Picking the right word to use in a problem that involves lots of assumptions is difficult when some of the assumptions involve "estimating" something and then using those estimates as if they were the actual values. )
 
  • Like
Likes   Reactions: Sunil Simha and jim mcnamara
Thanks for the answer Stephen. There are a couple of other points that are nagging me.
  1. In estimating the variance of the bin counts, the constraint that the sum of all bin counts is constant never appears (or does it? Did I miss something?). How would the estimate of the variance change if we were to explicitly take that into consideration (assuming you haven't already)?
  2. Assuming that the data points have a gaussian distribution (mean = value of redshift, std.dev = standard error in estimation), I can see intuitively that the std. dev. estimated for the counts will be less than the magnitude of the bin counts. This is required for consistency because if the std. dev. were greater than the bin count, then we're saying one of the bins could have negative count (negative values would be encompassed within 1 ##\sigma## intervals). Is there a rigorous proof for this?
  3. The probability distribution of the bin count for a given bin would obviously be skewed because negative values are not allowed. I suppose one would expect the distribution to tend to a Gaussian for large counts. However, for small values, this skewness would become more evident. Is there a way to estimate confidence intervals for small bin counts? Is it known what kind of distribution the bin counts follow?
 
Sunil Simha said:
  1. In estimating the variance of the bin counts, the constraint that the sum of all bin counts is constant never appears (or does it? Did I miss something?). How would the estimate of the variance change if we were to explicitly take that into consideration (assuming you haven't already)?
For example, suppose we had only two bins and they are represented by random variables ##C_1, C_2## then the bins have a joint probability distribution ##J(C_1,C_2)## and, as you observed, the two random variables ##C_1, C_2## are not independent because we assume each given measurement ##G_i## is constrained to fall in one of the two bins. However, the phrase "The standard deviation of ##C_1##" is only meaningful when we are talking about the distribution of ##C_1## alone, i.e. the marginal distribution of ##C_1##. The marginal distribution of ##C_1## only involves assigning a probability to the events "Measurement ##i## falls in ##C_1##". The marginal distribution of ##C_1## doesn't depend on where measurements that fall outside ##C_1## land.

If you are taking the standard deviations of the ##G_i## as given numbers, then the constraint on the total counts doesn't appear because we are dealing only with the marginal distributions of the counts. However, as I mentioned, there is a distinction between being given a parameter and estimating a parameter. If we were estimating the standard deviations of the ##G_i## from a set of several measurements of the same ##i##-th galaxy then the constraint might be needed because we would be estimating the ##p_{i,j}## for count ##C_j## instead of "calculating" the ##p_{i,j}## from given information. "Estimation" is a complicated topic and joint estimation is particularly complicated. The first consideration that arises in estimation, is which of the several criteria for a "good" estimator shall we use. This is usually a subjective decision. (For example, there is no axiom in statistics that says you must use the sample standard deviation as the estimate of the population standard deviation.)

2.
  • Assuming that the data points have a gaussian distribution (mean = value of redshift, std.dev = standard error in estimation), I can see intuitively that the std. dev. estimated for the counts will be less than the magnitude of the bin counts. This is required for consistency because if the std. dev. were greater than the bin count, then we're saying one of the bins could have negative count (negative values would be encompassed within 1 ##\sigma## intervals). Is there a rigorous proof for this?
Theoretically, even if the standard deviation if ##G_1## is smaller than a bin half-width and the mean of ##G_1## is in the center of a bin, there is still a probability that an observed value ##g## from a gaussian distribution might be arbitrarily large or small and hence fall outside any given bin. Using a gaussian distribution is an approximation. The purely mathematical issue involves whether a gaussian is an "adequate" approximation. Writing a rigorous proof of "adequacy" would require that we precisely define what "adequacy" means in the context of the problem.

It's a question of science (e.g. other studies of redshift measurements of a given galaxy) how the random variable ##G_i## is distributed. For example, is a lognormal a more plausible distribution for ##G_i## than a normal distribution?

  • 3.The probability distribution of the bin count for a given bin would obviously be skewed because negative values are not allowed. I suppose one would expect the distribution to tend to a Gaussian for large counts. However, for small values, this skewness would become more evident. Is there a way to estimate confidence intervals for small bin counts? Is it known what kind of distribution the bin counts follow?

By "skewed" do you mean "asymmetric about the mean"? According the model I gave, each bin count ##C_j## is a convolution of binomial distributions. This is a discrete distribution and may be asymmetric about its mean. We can look up how to do confidence intervals for convolutions of binomial distributions. (There are probably several different ways. There is no axiom that says a confidence interval must be based on a symmetric distribution of outcomes.)

It is the redshift measurements ##G_i## that have a continuous distribution. (I assumed ##G_i## has a normal distribution, but you could assume a different type of continuous distribution. All that is needed is that the assumed distribution of ##G_i## allows you to compute the probability ##p_{i,j}## that the ##i##-th sample (which is the sample taken from ##G_i##) falls in the ##j##-th bin.
 
  • Like
Likes   Reactions: Sunil Simha

Similar threads

  • · Replies 10 ·
Replies
10
Views
6K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 21 ·
Replies
21
Views
3K
  • · Replies 37 ·
2
Replies
37
Views
5K
  • · Replies 40 ·
2
Replies
40
Views
5K
  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 1 ·
Replies
1
Views
3K