Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Confidence interval for estimated mean of (discrete) uniform distribution

  1. Jan 13, 2012 #1
    Say that there is a random variable X ~ U(a,b) where U is the discrete uniform distribution on integers on the interval [a,b]. Sample n such variables with the same (unknown) parameters a and b. Using those samples it's possible to estimate the mean either by taking the sample mean (sum the value of each sample and divide by n), but how would I be able to calculate a confidence interval for the estimate of the mean?

    (Note: It's also possible to calculate the mean using the sample mid-range, but since the data gathering is done manually, any error which puts a value outside the interval [a,b] would with the sample mid-range forever bias the estimate of the mean, regardless of how many samples are taken.)

    For the data I use this for now the sample sizes are small, typically n≤10.

    Example data:

    Any solutions or guidance to how to approach this (and resources for relevant material) are highly appreciated.
  2. jcsd
  3. Jan 13, 2012 #2

    Stephen Tashi

    User Avatar
    Science Advisor

    No offense intended, but can we make sure that you are using the terminology "confidence interval" in the usual technical sense of that phrase. Some people say "confidence interval" when they really mean other things, such as a "prediction interval" or perhaps a Bayesian version of a confidence interval. What do you mean by a "confidence interval"?
  4. Jan 13, 2012 #3
    None taken.

    I want to be able to give an interval within which I can with a given certainty (e.g. 95%) say that the mean of the distribution (not the sample) lies.
  5. Jan 13, 2012 #4

    Stephen Tashi

    User Avatar
    Science Advisor

    Assuming that I interpret "certainty" to mean "probability", do you want to be able to state a specific numerical interval based on an observed value taken from your data? - something like "There is a 95% probability that the mean of the distribution is between 1.8 and 3.8"?

    If you want that, you are asking for a "credible interval" - at least according to the current Wikipedia article on "confidence interval". This is a natural thing to want, but you have to make enough assumptions to employ Bayesian statistics in order to get it.

    In many fields of study, published papers use "confidence interval" by tradition. So if you are writing some kind of report, there is that consideration. We can discuss either approach, but "confidence intervals" do not have the same interpretation as "credible intervals".
  6. Jan 13, 2012 #5
    This is strictly for my own uses, so tradition is not an important aspect.

    I read briefly in the articles on credible intervals and confidence intervals and just want to paste in the following (if for nothing else, so that I can access it easier in this thread):

    With that said, what I was asking for in the first post is, just as you said, the credible interval. However, upon reading this I realize that my question is partially based on the misconception that a confidence interval of an estimate of a parameter would contain the parameter with a probability given by the confidence level. I'm not sure if I misremember the one statistics class I've taken or if it was taught erroneously (or if it was just for some subset of possible problems where the two are the same).

    If possible, I would very much like to be able to give a credible interval; presuming I understand the idea correctly, it is what feels most natural to me to describe the certainty of an estimation if there is only a single experiment done. However, the Wiki article says that
    and as far as I can tell, I have no certain prior distribution to work with. How would one arrive at a prior distribution? Could you select any prior distribution and work with it, with the only caveat that the posterior distribution would vary depending on how you chose your prior?

    If it is not possible, or if it proves to be too complex for me to learn right now, then a confidence interval would work.

    But, just to see if I understand the distinction between the two, if I used a confidence interval instead of a credible interval:
    1. I could not say that with a confidence level of certainty the true value of the estimated parameter is within the confidence interval (e.g. that there's a 90% probability that the true value is within the confidence interval)
    2. I could say that if there are a large number of different experiments run, and for each experiment a confidence interval is calculated, then a confidence level fraction of the confidence intervals would contain the true value. (e.g. for a 90% confidence level, 90% of the calculated confidence intervals would contain the true value).

    Is that right?
  7. Jan 13, 2012 #6

    Stephen Tashi

    User Avatar
    Science Advisor

    That is essentially correct. There are actually two versions of "confidence interval". The most respectable version (in my opinion) is an interval which has a length but not a definite center or definite endpoints. For example one might say "the 90% confidence interval for the mean in this sampling plan is plus or minus 6.37". This relates to your number 2. situation. When people observe a specific sample mean, such as 5.3, some will also call the numerical interval (5.3 - 6.37, 5.3 + 6.37) a "90% confidence interval". (My old college statistics text says that this is done "by abuse of language".) This relates to you situation number 1 and , as you say, it is not correct to claim that there is a 90% probability that population mean is in a given interval with specific numerical endpoints.

    To say something about the probability of the population mean being somewhere, one must admit the idea that this probability can exist. (Confidence intervals for the mean assume the population mean is in a "fixed but unknown" location and this contradicts the idea that there is anything probabilistic about it.) To compute a "credible interval" one must assume a distribution for the sample mean that applies before we observe the sample ( a so-called "prior distribution".) There are several approaches to estimating prior distributions.

    The most comfortable situation is when you are dealing with a situation similar to ones that you have done before. For example, if were trying to find the mean concentration of gold in an ore sample, you might have assayed hundreds of other samples and you might use a histogram of those concentrations as the prior distribution.

    If you are dealing with a one-of-a-kind situation, you must be willing to imagine some probability model where that situation is one realization of many possibilities. For example, if you were measuring the mass of Jupiter, you could think of the process that gave it its mass as being probabilistic. Such a probability model may suggest a prior distribution. If not, then one make take the approach of "maximum entropy". The general idea of that is to assume a prior distribution which has maximum uncertainty, in the technical sense of having maximum entropy. (A famous advocate of this approach was Edwin Jaynes and his book "Probability The Logic Of Science" is available on the web.)

    There are cases when no prior distribution exits. For example, if you think "the mean is equally likely to be any real number", you can't define a uniform distribution on the real numbers. However, one can sometimes do philosophically suspicious things like defining a uniform distribution on [-n,n], finding a credible interval that is a function of n and then taking the limiting value of the interval as n approaches infinity.

    Amusingly, in many problems, the Bayesian credible intervals turn out to be numerically the same interval as a confidence interval.
  8. Jan 17, 2012 #7

    Stephen Tashi

    User Avatar
    Science Advisor

    There are probably texts or papers that work your problem in a Bayesian manner. Nevertheless it might be interesting to try solving it in a naive straightforward way without looking up such materials. I don't know if I can do this correctly, but I'll start.

    Lets assume the population values are integers. Let [itex] A = [/itex] the minimum possible value and [itex] D = [/itex] the number of possible values. The distribution of the population given [itex] A [/itex] and [itex] D [/itex] is uniform, so it assigns a probability of [itex] \frac{1}{D} [/itex] to each integer in the interval [itex] [A,A+D-1] [/itex].

    For a prior distribution of population parameters, assume [itex] A [/itex] is equally likely to be any integer between (or including) the given integers [itex] A_L [/itex] and [itex] A_U[/itex]. Assume [itex] D [/itex] is equally likely to be any integer between (and including) the given integers [itex] D_L [/itex] and [itex] D_U [/itex].

    So for integers [itex] (a,d) [/itex] in the permitted range, [itex] P(A=a,D=d) = \frac{1}{A_U - A_L + 1} \frac{1}{D_U - D_L + 1} [/itex]
    Denote this constant by [itex] P(A=a,D=d) = \lambda [/itex].

    Let [itex] X [/itex] be the random variable representing a vector of [itex] n [/itex] independent samples [itex] X_1, X_2,...X_n [/itex] from the population.
    Let [itex] X_{min} = min\{X1,X2,...X_n\} [/itex]
    Let [itex] X_{max} = max\{X1,X2,...X_n\} [/itex]

    [itex] P(X=x| A=a,D=d) = \frac{1}{d^n} [/itex] if [itex] x_{min} \ge a [/itex] and [itex] x_{max} \le a+d-1 [/itex]
    [itex] P(X=x|A=a,D=d) = 0 [/itex] otherwise

    [itex] P(X=x,A=a,D=d) = \frac{1}{d^n} \lambda [/itex] if [itex] A_L \le a \le x_{min} [/itex] and [itex] x_{max} - x_{min} + 1 \le d \le D_U [/itex]
    [itex] P(X=x,A=a,D=d) = 0 [/itex] otherwise.

    [itex] P(X=x) = \sum_{a=A_L}^{x_{min}} \sum_{d=x_{max} - x_{min} + 1}^{D_U} \frac{1}{d^n} \lambda [/itex]

    The above summation shouldn't be worse than summing part of geometric series.

    [itex] P(A=a,D=d| X=x) = \frac{ P(X=x,A=a,D=d)}{P(X=x)} [/itex]

    I think the above shows that we can calculate numerically the conditional distribution the parameters [itex] A,D [/itex] given the sample.

    It remains to calculate the conditional distribution of the mean [itex] \mu [/itex] given [itex]X [/itex], which is a function of those parameters. We can try that if you're interested. Or maybe it's simpler to look in book!
  9. Jan 19, 2012 #8

    This discussion may be of interest. The basis for determining the confidence interval for a uniform distribution depends on the Central Limit Theorem. Whatever theoretical objections may be raised, confidence intervals are still widely used in scientific articles published in peer reviewed journals.

  10. Jan 19, 2012 #9
    I've read both your posts and thank you for the assistance so far. I'm currently digesting it and trying to give it some more thought on my own, but I don't have very much time to spare at the moment. I will try to come back in a few days with comments and more questions.
  11. Feb 17, 2012 #10
    New reply since I cannot edit my last post:

    It's become painfully obcious to me that I do not currently have the time to pursue this ("a few days"... pfft >_>)

    Hopefully I will be able to pick this up again at a later point in time. Until then, thank you Stephen Tashi and SW VandeCarr for contributing.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook