I was wondering, what is the estimated frequency for the most frequent number in lottery draws? Of course, I don't know which number it will be, but will the probability for that number converge to a certain estimate? What would be the equation for possible N numbers (e.g. N=49) for the probability P of the most frequent number? Can I even estimate the standard deviation on that estimate with an equation? Is it even possible to give a general form for the second most frequent number and so on (i.e. P(1), P(2),...)?
In advance they have 1/N. But after 1000 draws there is a very high probability that one of the numbers will appear more often. For example the same is true for the 1D random walk, where a drunk sailor is walking either left or right each step. After N steps the expected distance from the center is sqrt(N) - so there is an inbalance expected. I searched on the internet and this topic seems to be called "order statistics". I'm just not sure how do the maths and if correlations matter... :( Experimentally I find for drawing 6 out of 49 numbers (10000 times) about 12.33(1)% for the most likely number and 12.17(1)% for the least likely number.
This is an important example of how distinct diverse patterns can arise out of a uniformly random process. If k small random of samples of size n are isolated from a large uniform randomly generated set of size N such that N/n is large, then the distribution of the means of k samples would have greater variance then if N/n were small. Each sample is then allowed to randomly grow according to its distribution parameters to large N' and the process repeated. One gets increasingly different distributions as the process is repeated. . This will occur without any non-random selection process. It can occur by isolation alone.
Considering expectations may shed some light on this problem solution The probability that a specified number will occur exactly j times in r drawings follows the binomial distribution: p(j,r)=b(j;r,1/n) (j is number of successes, r is number of drawings and 1/n is probability for success) Thus expected number of numbers that will occur exactly j times in r drawings is simply E=n*p(j,r) So take n=49 and say r=188 Expected number of numbers that will not occur in 188 drawings is close to 1. Expected number of numbers that will occur exactly 3 times in 188 drawings is close to 10. Expected number of numbers that will occur exactly 8 times in 188 drawings is again close to 1.
I don't need to know the number of numbers occuring j times. I only want to find the occurance of the most frequent number. Basically that's the just "ordered statistics" problem, but I don't know how to apply the equations and also not sure if correlation between the counts of all numbers play a role.
In given case n=49 and r=188 the most frequent number will occur 8 times(in average) Do you want to know the probability of this happening?
Your number seems correct experimentally. Though, I haven't quite understood where it came from. Also I cannot imagine that one can dismiss order statistics or is your method equivalent in this case? I'd be interested in the best analystical expression (normal approximation) to estimate the frequency of the most appearing number. And how does it make a difference that I'm actually drawing 6 numbers from 49 in one go?
Ok, i will try to analyze your input (drawing 6 out of 49 numbers (10000 times)) with my expectation approach. That means we set n=49 and r=6*10000=60000 in the expectation formula. Below is a piece of the formula outputs: j.............E 1215 0.5456 1216 0.5495 1217 0.5530 1218 0.5560 1219 0.5586 1220 0.5607 1221 0.5623 1222 0.5635 1223 0.5642 1224 0.5644 1225 0.5642 1226 0.5635 1227 0.5623 1228 0.5607 1229 0.5586 1230 0.5561 1231 0.5531 From this we get that the max. expectation 0.5644 falls on j=1224 and this means that the most frequent number will occur 1224 times in average. But note that differences with neighbors are negligible and in practice there is no reason to assume that one of the numbers will appear more often. But lets try now with r=1000 Below is a piece of the formula outputs: j...........E 20 4.3787 21 4.2571 22 3.9467 23 3.4962 24 2.9651 25 2.4116 26 1.8841 27 1.4160 28 1.0251 29 0.7158 30 0.4827 31 0.3146 32 0.1985 33 0.1213 34 0.0719 35 0.0413 36 0.0231 Now we see a number with frequency 28 times certainly will appear because expectation is close to 1. The same applies to a number with frequency 27. But with high probability two numbers will occur with frequency 26 because expectation is close to 2 and so on. A conclusion: The higher the number of drawings, the lower the probability that one of the numbers will appear more often. I think, very often it is easier to analyze things via expectations rather than via complicated probability distributions.
For an ordered complete sequence of n integers a,b under a uniform distribution and some integer k such that: [tex]a\leq k\leq b[/tex] and the probability mass function is 1/n The mean is a+b/2 and the variance is [tex] \frac{(b-a-1)^2}{12}[/tex]. From this you can calculate a standard deviation (SD). However I don't think the SD is really defined for the uniform distribution so I don't believe your question can be answered analytically. The SD is based on the normal distribution. EDIT: The uniform probability that you will draw a given number k from n=49 in r trials is 1-((n-1)/n)^r. The probability of k being drawn q times in r trials is ((1-((n-1)/n)^r)^q. There's no way to predict a maximal value of q in any given experiment to my knowledge. I think the proper question is the one I alluded to in post 4. Given r random samples of size n (n>2) from a uniform distribution of size N (N>n), what is the probability of a sample mean equal to or exceeding some value k; [tex] a\leq k\leq b[/tex] as r grows large. This can be obtained from normal theory based on the Central Limit Theorem. It's understood that with the normal distribution, certain sample means will be more probable than others.
Sounds like you're interested in showing whether or not some observed frequencies are statistically significant. To formulate the problem more precisely, the lotto consists of k samples without replacement from a population of size n, repeated r times. Let the total counts of each lotto number be [tex](N_1,...,N_n)[/tex]. (I reversed the capitalizations to make it more obvious what are the random variables.) Let the observed frequency of each lotto number be [tex]X_i=N_i/r[/tex]. The fundamental questions are: What is the joint distribution of [tex](X_1,...,X_n)[/tex]? What is the distribution of [tex]\max(X_1,...,X_n)[/tex]? What is the joint distribution of the order statistics of [tex](X_1,...,X_n)[/tex]? What are the asymptotics of these? This would be difficult if not intractable except for a few small cases (e.g. using multivariate generating polynomials). With Eero's insight that the marginal distribution of each N is binomial, and adapting SW VandeCarr's CLT idea, each X would be [tex]k/n\pm O(1/\sqrt{r})[/tex]. This tells us that whatever their dependence structure they are all clustered around [tex]k/n[/tex], which agrees with your simulation. Eero's next step defines [tex]Y_j=\#\{i:N_i=j\}[/tex] which can be written as a sum of indicator functions so his formula for [tex]E\left[Y_j\right][/tex] holds by linearity - but I don't yet understand how the distribution of the maximum frequency can be inferred this way. Wouldn't the distribution of the maximum vary with the dependence structure?