I Computing uncertainties in histogram bin counts

Tags:
1. Dec 13, 2016

Sunil Simha

I am working on astrophysical data and I have a large number of redshift values of quasars. Now, each redshift estimate comes with its estimated standard error naturally. If I plot a histogram of these redshifts, I would expect the bins counts to also have some sort of uncertainty.

I am unable to figure out how the uncertainties in the redshift estimates propagate to the bin counts. I also seem to be unable to phrase this properly and get material online that addresses this problem. Any suggestions as to how to go about computing the uncertainties?

It would also be a great help if I could get some references.

2. Dec 13, 2016

Stephen Tashi

I'll assume by "uncertainty" you refer to the standard deviation of a random variable. Then the work begins with defining that random variable.

If we consider the count C in the $j$-th bin to be a random variable, then you have an ill-posed problem. You know the standard deviation of the measurement $G_i$ from the $i$-th galaxy but you don't know the correct measurement for that galaxy, which is presumably the mean of the random variable $G_i$ that was sampled to produce the measurement. Not knowing the mean of $G_i$, you can't calculate the probability that the realized value of $G_i$ falls in a particular bin. (i.e. You can't calculate the probability that $G_i$ is in the specific interval $[a_j, b_j]$ that defines the $j$-th bin).

(As a simple analogy, consider this problem: Given that G has a normal distribution with standard deviation 1.3 and unknown mean, find the probability that a random sample from G has a value between 5 and 6.)

Ill-posed problems never stop statisticians who are determined to apply statistics. They add assumptions till the problem is better-posed. (A good example of this is the assumptions used in the problem of estimating the standard deviations of parameters that are produced by curve-fitting methods, such as regression.)

The simplistic way to make the problem well-posed is to say some words to the effect that you will assume the particular measurement $g_i$ of the random variable $G_i$ can be used as the mean of the distribution of $G_i$. Then, assuming $G_i$ has a gaussian distribution, you can compute the probability that a random sample from $G_i$ lies in a particular interval.

A more complicated way to make the problem well-posed is to take a Bayesian approach and assume some prior distribution for the mean of $G_i$. After some possibly not-simple calculations, the approach also allows you calculate the probability that a random sample from $G_i$ lies in a particular interval.

I prefer the Bayesian approach if the result of the analysis is to be used for some important decision. If the result of the analysis is just to fill-in-a-blank in a report then I'd used the simple approach.

Consider a particular bin, whose endpoints are [a_j,b_j]. Define the random variable $c_i$ to be 1 if the measurement of $G_i$ falls in the bin and $c_i = 0$ otherwise. So $c_i$ is a bernoulli random variable with distribution $Pr(c_i = 1) = p_i,\ Pr(c_i= 0) = 1 - p_i$ and we have made some assumptions that let us compute the numerical value of $p_i$.

The variance of $c_i$ is $p_i(1-p_i)$. The variance of the total count is the variance of $C = \sum_{i=1}^N c_i$ where the sum is taken over the $N$ measurements.

Assuming the measurements are independent, the variance of $C$ is the sum of the variances of the $c_i$ The standard deviation of $C$ is the square root of that variance.

( There is technically a difference between "calculating" a (population) variance and "estimating" that variance from sample data. Picking the right word to use in a problem that involves lots of assumptions is difficult when some of the assumptions involve "estimating" something and then using those estimates as if they were the actual values. )

3. Dec 13, 2016

Sunil Simha

Thanks for the answer Stephen. There are a couple of other points that are nagging me.
1. In estimating the variance of the bin counts, the constraint that the sum of all bin counts is constant never appears (or does it? Did I miss something?). How would the estimate of the variance change if we were to explicitly take that into consideration (assuming you haven't already)?
2. Assuming that the data points have a gaussian distribution (mean = value of redshift, std.dev = standard error in estimation), I can see intuitively that the std. dev. estimated for the counts will be less than the magnitude of the bin counts. This is required for consistency because if the std. dev. were greater than the bin count, then we're saying one of the bins could have negative count (negative values would be encompassed within 1 $\sigma$ intervals). Is there a rigorous proof for this?
3. The probability distribution of the bin count for a given bin would obviously be skewed because negative values are not allowed. I suppose one would expect the distribution to tend to a Gaussian for large counts. However, for small values, this skewness would become more evident. Is there a way to estimate confidence intervals for small bin counts? Is it known what kind of distribution the bin counts follow?

4. Dec 14, 2016

Stephen Tashi

For example, suppose we had only two bins and they are represented by random variables $C_1, C_2$ then the bins have a joint probability distribution $J(C_1,C_2)$ and, as you observed, the two random variables $C_1, C_2$ are not independent because we assume each given measurement $G_i$ is constrained to fall in one of the two bins. However, the phrase "The standard deviation of $C_1$" is only meaningful when we are talking about the distribution of $C_1$ alone, i.e. the marginal distribution of $C_1$. The marginal distribution of $C_1$ only involves assigning a probability to the events "Measurement $i$ falls in $C_1$". The marginal distribution of $C_1$ doesn't depend on where measurements that fall outside $C_1$ land.

If you are taking the standard deviations of the $G_i$ as given numbers, then the constraint on the total counts doesn't appear because we are dealing only with the marginal distributions of the counts. However, as I mentioned, there is a distinction between being given a parameter and estimating a parameter. If we were estimating the standard deviations of the $G_i$ from a set of several measurements of the same $i$-th galaxy then the constraint might be needed because we would be estimating the $p_{i,j}$ for count $C_j$ instead of "calculating" the $p_{i,j}$ from given information. "Estimation" is a complicated topic and joint estimation is particularly complicated. The first consideration that arises in estimation, is which of the several criteria for a "good" estimator shall we use. This is usually a subjective decision. (For example, there is no axiom in statistics that says you must use the sample standard deviation as the estimate of the population standard deviation.)

Theoretically, even if the standard deviation if $G_1$ is smaller than a bin half-width and the mean of $G_1$ is in the center of a bin, there is still a probability that an observed value $g$ from a gaussian distribution might be arbitrarily large or small and hence fall outside any given bin. Using a gaussian distribution is an approximation. The purely mathematical issue involves whether a gaussian is an "adequate" approximation. Writing a rigorous proof of "adequacy" would require that we precisely define what "adequacy" means in the context of the problem.

It's a question of science (e.g. other studies of redshift measurements of a given galaxy) how the random variable $G_i$ is distributed. For example, is a lognormal a more plausible distribution for $G_i$ than a normal distribution?

By "skewed" do you mean "asymmetric about the mean"? According the model I gave, each bin count $C_j$ is a convolution of binomial distributions. This is a discrete distribution and may be asymmetric about its mean. We can look up how to do confidence intervals for convolutions of binomial distributions. (There are probably several different ways. There is no axiom that says a confidence interval must be based on a symmetric distribution of outcomes.)

It is the redshift measurements $G_i$ that have a continuous distribution. (I assumed $G_i$ has a normal distribution, but you could assume a different type of continuous distribution. All that is needed is that the assumed distribution of $G_i$ allows you to compute the probability $p_{i,j}$ that the $i$-th sample (which is the sample taken from $G_i$) falls in the $j$-th bin.