Probability for the most frequent number in lottery?

  1. I was wondering, what is the estimated frequency for the most frequent number in lottery draws? Of course, I don't know which number it will be, but will the probability for that number converge to a certain estimate?

    What would be the equation for possible N numbers (e.g. N=49) for the probability P of the most frequent number?
    Can I even estimate the standard deviation on that estimate with an equation?

    Is it even possible to give a general form for the second most frequent number and so on (i.e. P(1), P(2),...)?
     
    Last edited: Oct 19, 2009
  2. jcsd
  3. mathman

    mathman 6,621
    Science Advisor
    Gold Member

    If it's a fair lottery all numbers would have the same probability, that is 1/N.
     
  4. In advance they have 1/N. But after 1000 draws there is a very high probability that one of the numbers will appear more often.

    For example the same is true for the 1D random walk, where a drunk sailor is walking either left or right each step. After N steps the expected distance from the center is sqrt(N) - so there is an inbalance expected.

    I searched on the internet and this topic seems to be called "order statistics". I'm just not sure how do the maths and if correlations matter... :(

    Experimentally I find for drawing 6 out of 49 numbers (10000 times) about 12.33(1)% for the most likely number and 12.17(1)% for the least likely number.
     
    Last edited: Oct 19, 2009
  5. This is an important example of how distinct diverse patterns can arise out of a uniformly random process. If k small random of samples of size n are isolated from a large uniform randomly generated set of size N such that N/n is large, then the distribution of the means of k samples would have greater variance then if N/n were small. Each sample is then allowed to randomly grow according to its distribution parameters to large N' and the process repeated. One gets increasingly different distributions as the process is repeated. . This will occur without any non-random selection process. It can occur by isolation alone.
     
    Last edited: Oct 20, 2009
  6. Considering expectations may shed some light on this problem solution

    The probability that a specified number will occur exactly j times in r drawings follows the binomial distribution:

    p(j,r)=b(j;r,1/n)

    (j is number of successes, r is number of drawings and 1/n is probability for success)

    Thus expected number of numbers that will occur exactly j times in r drawings is simply

    E=n*p(j,r)

    So take n=49 and say r=188

    Expected number of numbers that will not occur in 188 drawings is close to 1.
    Expected number of numbers that will occur exactly 3 times in 188 drawings is close to 10.

    Expected number of numbers that will occur exactly 8 times in 188 drawings is again close to 1.
     
  7. I don't need to know the number of numbers occuring j times.

    I only want to find the occurance of the most frequent number.

    Basically that's the just "ordered statistics" problem, but I don't know how to apply the equations and also not sure if correlation between the counts of all numbers play a role.
     
  8. In given case n=49 and r=188 the most frequent number will occur 8 times(in average)
    Do you want to know the probability of this happening?
     
  9. Your number seems correct experimentally. Though, I haven't quite understood where it came from. Also I cannot imagine that one can dismiss order statistics or is your method equivalent in this case?

    I'd be interested in the best analystical expression (normal approximation) to estimate the frequency of the most appearing number.

    And how does it make a difference that I'm actually drawing 6 numbers from 49 in one go?
     
    Last edited: Oct 20, 2009
  10. Ok, i will try to analyze your input (drawing 6 out of 49 numbers (10000 times)) with my expectation approach. That means we set n=49 and r=6*10000=60000 in the expectation formula.

    Below is a piece of the formula outputs:

    j.............E
    1215 0.5456
    1216 0.5495
    1217 0.5530
    1218 0.5560
    1219 0.5586
    1220 0.5607
    1221 0.5623
    1222 0.5635
    1223 0.5642
    1224 0.5644
    1225 0.5642
    1226 0.5635
    1227 0.5623
    1228 0.5607
    1229 0.5586
    1230 0.5561
    1231 0.5531

    From this we get that the max. expectation 0.5644 falls on j=1224 and this means that the most frequent number will occur 1224 times in average.
    But note that differences with neighbors are negligible and in practice there is no reason to assume that one of the numbers will appear more often.

    But lets try now with r=1000

    Below is a piece of the formula outputs:

    j...........E
    20 4.3787
    21 4.2571
    22 3.9467
    23 3.4962
    24 2.9651
    25 2.4116
    26 1.8841
    27 1.4160
    28 1.0251
    29 0.7158
    30 0.4827
    31 0.3146
    32 0.1985
    33 0.1213
    34 0.0719
    35 0.0413
    36 0.0231

    Now we see a number with frequency 28 times certainly will appear because expectation is close to 1.
    The same applies to a number with frequency 27.
    But with high probability two numbers will occur with frequency 26 because expectation is close to 2
    and so on.

    A conclusion:

    The higher the number of drawings, the lower the probability that one of the numbers will appear more often.

    I think, very often it is easier to analyze things via expectations rather than via complicated probability distributions.
     
  11. For an ordered complete sequence of n integers a,b under a uniform distribution and some integer k such that:

    [tex]a\leq k\leq b[/tex] and the probability mass function is 1/n

    The mean is a+b/2 and the variance is [tex] \frac{(b-a-1)^2}{12}[/tex].

    From this you can calculate a standard deviation (SD). However I don't think the SD is really defined for the uniform distribution so I don't believe your question can be answered analytically. The SD is based on the normal distribution.

    EDIT: The uniform probability that you will draw a given number k from n=49 in r trials is 1-((n-1)/n)^r. The probability of k being drawn q times in r trials is ((1-((n-1)/n)^r)^q. There's no way to predict a maximal value of q in any given experiment to my knowledge.

    I think the proper question is the one I alluded to in post 4. Given r random samples of size n (n>2) from a uniform distribution of size N (N>n), what is the probability of a sample mean equal to or exceeding some value k; [tex] a\leq k\leq b[/tex] as r grows large. This can be obtained from normal theory based on the Central Limit Theorem. It's understood that with the normal distribution, certain sample means will be more probable than others.
     
    Last edited: Oct 23, 2009
  12. Sounds like you're interested in showing whether or not some observed frequencies are statistically significant.

    To formulate the problem more precisely, the lotto consists of k samples without replacement from a population of size n, repeated r times. Let the total counts of each lotto number be [tex](N_1,...,N_n)[/tex]. (I reversed the capitalizations to make it more obvious what are the random variables.) Let the observed frequency of each lotto number be [tex]X_i=N_i/r[/tex]. The fundamental questions are:
    1. What is the joint distribution of [tex](X_1,...,X_n)[/tex]?
    2. What is the distribution of [tex]\max(X_1,...,X_n)[/tex]?
    3. What is the joint distribution of the order statistics of [tex](X_1,...,X_n)[/tex]?
    4. What are the asymptotics of these?
    This would be difficult if not intractable except for a few small cases (e.g. using multivariate generating polynomials).

    With Eero's insight that the marginal distribution of each N is binomial, and adapting SW VandeCarr's CLT idea, each X would be [tex]k/n\pm O(1/\sqrt{r})[/tex]. This tells us that whatever their dependence structure they are all clustered around [tex]k/n[/tex], which agrees with your simulation.

    Eero's next step defines [tex]Y_j=\#\{i:N_i=j\}[/tex] which can be written as a sum of indicator functions so his formula for [tex]E\left[Y_j\right][/tex] holds by linearity - but I don't yet understand how the distribution of the maximum frequency can be inferred this way. Wouldn't the distribution of the maximum vary with the dependence structure?
     
Know someone interested in this topic? Share a link to this question via email, Google+, Twitter, or Facebook

Have something to add?