Sunil Simha said:
- In estimating the variance of the bin counts, the constraint that the sum of all bin counts is constant never appears (or does it? Did I miss something?). How would the estimate of the variance change if we were to explicitly take that into consideration (assuming you haven't already)?
For example, suppose we had only two bins and they are represented by random variables ##C_1, C_2## then the bins have a joint probability distribution ##J(C_1,C_2)## and, as you observed, the two random variables ##C_1, C_2## are not independent because we assume each given measurement ##G_i## is constrained to fall in one of the two bins. However, the phrase "The standard deviation of ##C_1##" is only meaningful when we are talking about the distribution of ##C_1## alone, i.e. the
marginal distribution of ##C_1##. The marginal distribution of ##C_1## only involves assigning a probability to the events "Measurement ##i## falls in ##C_1##". The marginal distribution of ##C_1## doesn't depend on where measurements that fall outside ##C_1## land.
If you are taking the standard deviations of the ##G_i## as given numbers, then the constraint on the total counts doesn't appear because we are dealing only with the marginal distributions of the counts. However, as I mentioned, there is a distinction between being given a parameter and
estimating a parameter. If we were
estimating the standard deviations of the ##G_i## from a set of several measurements of the same ##i##-th galaxy then the constraint might be needed because we would be
estimating the ##p_{i,j}## for count ##C_j## instead of "calculating" the ##p_{i,j}## from given information. "Estimation" is a complicated topic and
joint estimation is particularly complicated. The first consideration that arises in estimation, is which of the several criteria for a "good" estimator shall we use. This is usually a subjective decision. (For example, there is no axiom in statistics that says you must use the sample standard deviation as the estimate of the population standard deviation.)
2.
- Assuming that the data points have a gaussian distribution (mean = value of redshift, std.dev = standard error in estimation), I can see intuitively that the std. dev. estimated for the counts will be less than the magnitude of the bin counts. This is required for consistency because if the std. dev. were greater than the bin count, then we're saying one of the bins could have negative count (negative values would be encompassed within 1 ##\sigma## intervals). Is there a rigorous proof for this?
Theoretically, even if the standard deviation if ##G_1## is smaller than a bin half-width and the mean of ##G_1## is in the center of a bin, there is still a probability that an observed value ##g## from a gaussian distribution might be arbitrarily large or small and hence fall outside any given bin. Using a gaussian distribution is an approximation. The purely mathematical issue involves whether a gaussian is an "adequate" approximation. Writing a rigorous proof of "adequacy" would require that we precisely define what "adequacy" means in the context of the problem.
It's a question of science (e.g. other studies of redshift measurements of a given galaxy) how the random variable ##G_i## is distributed. For example, is a lognormal a more plausible distribution for ##G_i## than a normal distribution?
- 3.The probability distribution of the bin count for a given bin would obviously be skewed because negative values are not allowed. I suppose one would expect the distribution to tend to a Gaussian for large counts. However, for small values, this skewness would become more evident. Is there a way to estimate confidence intervals for small bin counts? Is it known what kind of distribution the bin counts follow?
By "skewed" do you mean "asymmetric about the mean"? According the model I gave, each bin count ##C_j## is a convolution of binomial distributions. This is a discrete distribution and may be asymmetric about its mean. We can look up how to do confidence intervals for convolutions of binomial distributions. (There are probably several different ways. There is no axiom that says a confidence interval must be based on a symmetric distribution of outcomes.)
It is the redshift measurements ##G_i## that have a continuous distribution. (I assumed ##G_i## has a normal distribution, but you could assume a different type of continuous distribution. All that is needed is that the assumed distribution of ##G_i## allows you to compute the probability ##p_{i,j}## that the ##i##-th sample (which is the sample taken from ##G_i##) falls in the ##j##-th bin.