I Mean time between lottery wins and probability of fraud by organizers

Jonathan212

Let's say the number 1 is picked 5% of the time, 2 is picked 4% of the time, etc to 50. That's 50 unknowns x1, x2, ..., x50. How do we get 50 equations to solve for these unknowns?

PeroK

Homework Helper
Gold Member
2018 Award
Let's say the number 1 is picked 5% of the time, 2 is picked 4% of the time, etc to 50. That's 50 unknowns x1, x2, ..., x50. How do we get 50 equations to solve for these unknowns?
I'm not sure what you are learning from this. The question isn't directly how popular each individual number is but how popular different six-number combinations are. I've tended to use "numbers" above as shorthand for "combination of six numbers".

Jonathan212

The dependence between numbers played in a ticket must be very weak. Got some data for the frequencies of individual winning numbers and about to post a histogram, unfortunately it's not the numbers played, only the winning numbers which indirectly tell us what people tend to pick.

PeroK

Homework Helper
Gold Member
2018 Award
The dependence between numbers played in a ticket must be very weak. Got some data for the frequencies of individual winning numbers and about to post a histogram, unfortunately it's not the numbers played, only the winning numbers which indirectly tell us what people tend to pick.
It's too little data. It's only a few hundred winning combinations as a sample of 15 million possibilities.

Jonathan212

Oopsa. Looks like this lottery is not as was thought. You play 5 numbers from 1 to 45 and 1 number from 1 to 20. Chances of a ticket winning are then 1 / (45 choose 5) * 1 / 20 = 1 / 24,435,180.

Jonathan212

And here are the histograms of winning frequencies for the first 5 numbers and for the 6th number: Not quite as sloped as expected!

Attachments

• 441 bytes Views: 41
• 266 bytes Views: 42
Last edited:

Jonathan212

So we had 10,457,692,468 tickets sold (that is a lot!) and 465 winners. At 1 in 24,435,180 we would expect 428 winners. What is the statistical significance of 465 winners when 428 winners are expected in 1707 draws? I want a figure like those "p<0.0021" expressions in drug research.

PeroK

Homework Helper
Gold Member
2018 Award
So we had 10,457,692,468 tickets sold (that is a lot!) and 465 winners. At 1 in 24,435,180 we would expect 428 winners. What is the statistical significance of 465 winners when 428 winners are expected in 1707 draws? I want a figure like those "p<0.0021" expressions in drug research.
I suppose it depends a lot on to what extent you can trust the data! The numbers do look high, obviously. Are there any other restrictions that we don't know about?

The actual calculation is difficult because of the variations from week to week. You can get an estimate by looking at the probability of getting up to 464 winners in 1707 trials with a probability of 0.25 per trial. This turns out to be 98%.

So, only a 2% chance of 465 or more winners.

Of course, the data could have been at the other extreme as well.

But, it's clear that 465 or more winners is more likely when you can get multiple winners. As I said, the exact calculation would be very complicated.

My guess is you're somewhere in the range of a probability of 4-5% (including the other extreme).

Last edited:

mfb

Mentor
Here's the winning numbers at that draw, played in 8 different tickets.

34 27 13 17 6 13

Surprise, it can't be birthday numbers. It's as if someone knew what would happen and bought the same combination 8 times to ensure he wouldn't have to share too much of the prize.
If someone knew all the numbers in advance then it wouldn't make sense to buy multiple tickets for the same drawing. Too suspicious if the winners have some connection, and with just 5 million tickets you are likely to be the only winner anyway.

We still don't have the actual time series of drawings.

@PeroK: For the variance of the expected number of winners the week-by-week data is a higher order correction (taking into account the correlation between the tickets).

If everyone picks numbers randomly and we expect 428 winners then the standard deviation is sqrt(428)=20.7 and 465 is 1.8 standard deviations away (p=0.073). Take into account that people favor some numbers and it gets even more likely. No evidence of manipulation from the total number of winners.

• PeroK

Jonathan212

If you compare the 464 winners with the 1707 trials it's hopeless, but if you compare them with the 10,457,692,468 tickets it's easy.

Take into account that people favor some numbers
That's exactly what those histograms disprove, whatever effect there is it is very weak. Could give it a value if you want to, the sum of frequencies for numbers 1 to 30 is 66.48% while it should be 30 / 45 = 66.66%. They are a tiny bit LESS popular than higher numbers!

Why is the standard deviation sqrt(428)?

mfb

Mentor
The histograms show no preference for specific numbers but they don’t show preferences for specific combinations.

The variance of a Poisson distribution is the same as its mean, the standard deviation is the square root of the variance.

Jonathan212

I can't reproduce that 465 winners number. How did you calculate it?

mfb

Mentor
I just summed the entries in the second column in the long table. I get the same result if I multiply the second and third row in the first table and then sum the products.

Here an xls file, that is more convenient than the text file.

Attachments

• 29.2 KB Views: 28

Jonathan212

The variance of a Poisson distribution is the same as its mean
Isn't Poisson distribution the distribution of the time between wins? I thought it's a binomial distribution we've got here instead, approximated as gaussian.

Last edited:

mfb

Mentor
Isn't Poisson distribution the distribution of the time between wins?
No.
I thought it's a binomial distribution we've got here instead, approximated as gaussian.
That is true as well. A Poisson distribution with a large expectation value is approximately a Gaussian distribution.

Jonathan212

Greetings. I'm intending to write this up for a non-expert high-school-level audience. Complete with links for explanations like the origin of "(45 choose 5)", why we look at a normal distribution, etc. But there is one point I haven't yet understood myself. Is it ok to NOT mention Poisson distribution at all and instead say that the number of tickets winning in the 16 years should follow a binomial distribution, which we approximate with a normal distribution like we did in my other question below?

mfb

Mentor
Is it ok to NOT mention Poisson distribution at all and instead say that the number of tickets winning in the 16 years should follow a binomial distribution, which we approximate with a normal distribution like we did in my other question below?
Sure. In that case you need the additional information that the variance of the normal distribution is equal to the mean.

Jonathan212

Can't I just ignore that information and instead give the fact that the binomial distribution in

= 1 - BINOMDIST( M - 1 , N , 0.5, 1 )

is approximated by the normal distribution in

= 1 - NORMDIST( M - 1, N * 0.5, SQRT( N * 0.5 * (1-0.5) ), 1 )

where we'd replace 0.5 by 1/24,435,180 and use N = 10,457,692,468 and M = 465 ?

Then the statistical significance of the M = 465 wins (ie the probability of 465 wins or more) is

p = 1 - NORMDIST( 465 - 1, 427.9768951, SQRT( 427.9768776 ), 1 )

p = 0.040816379

That's not the same as your p=0.073 result in #59. Am I doing something wrong?

EDIT: just found the error. You're looking at the "|z| >" value but you should be looking at "z >". And because we want 465 or more, ie > 464, you should have calculated how many standard deviations 464 is from 428, not 465 from 428. That's 1.74129038 standard deviations and we get the same result at z > 1.74129.

Last edited:

Jonathan212

In drug research the results are stated like this: p<0.01. How can we do the same in this problem? Ie how can we establish an upper bound for p given that the normal we're looking at is only an approximation to the binomial?

Last edited:

Jonathan212

Is there any site where you can calculate extreme binomial integrals like this one without the normal approximation?

= 1 - BINOMDIST( 465 - 1, 10457692468, 1/24435180, 1 )

mfb

Mentor
Can't I just ignore that information and instead give the fact that the binomial distribution in

= 1 - BINOMDIST( M - 1 , N , 0.5, 1 )

is approximated by the normal distribution in

= 1 - NORMDIST( M - 1, N * 0.5, SQRT( N * 0.5 * (1-0.5) ), 1 )

where we'd replace 0.5 by 1/24,435,180 and use N = 10,457,692,468 and M = 465 ?
There it is (bold added by me).
EDIT: just found the error. You're looking at the "|z| >" value but you should be looking at "z >".
Why? Wouldn't a deviation in the other direction be equally suspicious?
And because we want 465 or more, ie > 464, you should have calculated how many standard deviations 464 is from 428, not 465 from 428.
Within the approximation from the Poisson distribution or normal distribution this doesn't matter. 464.5 should be slightly better.

WolframAlpha can calculate some extreme values. Check individual parts - you'll see the approximation is a *really* good one here.

Jonathan212

Why did you add the bold? To say it is incorrect? This is the formula we derived in the other thread for an identical problem with different N, M and probability. EDIT: it matches WolframAlpha perfectly too, if you type it in Excel.

A deviation in the opposite direction, it too few winning tickets, would not line the pockets of the organizers as easily because there are accountants auditing where the money goes when there is no win - it goes to the next draw.

Last edited:

Jonathan212

Another question is how many digits of this p = 0.040816379 result should we trust. Should the statistical significance be shown as "p < 0.05"?

mfb

Mentor
Why did you add the bold? To say it is incorrect?
It is not incorrect. Check how you started the post (it is in the quote). You asked "can I ignore that, and just use [...]", but this "[...]" included the information you asked about.
A deviation in the opposite direction, it too few winning tickets, would not line the pockets of the organizers as easily because there are accountants auditing where the money goes when there is no win - it goes to the next draw.
A larger jackpot tends to attract more players, which means a larger profit for the organizers.
Another question is how many digits of this p = 0.040816379 result should we trust.
Certainly don't use more than two significant figures. p=0.041 looks good, p=0.04 is not bad either. It is not small enough to claim fraud, especially as we know there are factors that make us underestimate the p-value.

Jonathan212

Does a question like "what is the probability that the organizers have never cheated by adding a winner after a draw?" make sense mathematically?

"Mean time between lottery wins and probability of fraud by organizers"

Physics Forums Values

We Value Quality
• Topics based on mainstream science
• Proper English grammar and spelling
We Value Civility
• Positive and compassionate attitudes
• Patience while debating
We Value Productivity
• Disciplined to remain on-topic
• Recognition of own weaknesses
• Solo and co-op problem solving