I Mean time between lottery wins and probability of fraud by organizers

Jonathan212 · May 28, 2019

Looked at some lottery wins and something was fishy. This a lottery where you pick 5 numbers out of the set (1,2, ..., 50). When no one wins, the money goes to the next iteration of the game so the prize gets bigger and bigger. It seemed that a win was too regular around every 2 or 3 weeks and never occurred in consecutive draws. As if people were waiting for the money to accumulate, which is probably true but there is another possibility also, and that is fraud by the organizers: too few people play, no wins occur at all for ages, and because this demotivates players and could reduce sales to a possible collapse, the organizers cheat and win the prize themselves periodically. How would you investigate this mathematically based on the observed distribution of time between wins? What is the expected mean and standard deviation of the time between (full-match) wins? Given the number of 5-number sets played in each iteration: K = 100,000 and iterations per week = 2.

The observed time between wins seems to have a sharp distribution with a mean around 2.5 weeks and this is fishy. Based on an observation like this, what is the probability that the organizers cheat?

What should K be to match the observed mean time between wins?

Dale · May 28, 2019

Jonathan212 said:

What is the expected mean and standard deviation of the time between (full-match) wins?

I don’t think the answer can be determined without knowing the number of tickets sold. If K fluctuates then it would be complicated.

Jonathan212 · May 28, 2019

Sure, just pretending we know K and it is fixed, just to derive the formula. Then stick in some real numbers. Or start from the 2.5 weeks and derive a fixed K, then work out standard deviation around the 2.5 weeks and compare this with observed deviation.

mfb · May 28, 2019

There are (50 choose 5) options to pick lottery numbers, on average you expect a jackpot winner every (50 choose 5)/K weeks (shared jackpots count as multiple winners). Plug in numbers and see if they look realistic. The assumption of a constant number of players is very unrealistic, however. Larger jackpots attract more players.

Sometimes so "many players" that they change the lottery (long article, skip to the first mention of "Winfall" for the lottery part).

Jonathan212 · May 28, 2019

What's (50 choose 5) short for?

mfb · May 28, 2019

The number of ways to pick 5 elements out of 50 where the order doesn't matter. Wikipedia has formulas

Dale · May 28, 2019

You can use the binomial distribution to calculate the number of likely winners each week. For that you need the number of players (K) and the probability of each player winning (1/[50 choose 5]). Then the probability of at least 1 person winning is 1 - the probability of 0 people winning. That latter probability is exactly what the binomial distribution gives.

That gives you the probability of a winner each draw, call that P. The probability that someone wins in the Nth week is then ##(1-P)^{N-1}P##

However, if you detect a significant discrepancy from this model then it is not an indication of fraud. This model assumes K is constant and that the players number selection is independent. Neither of those is an indication of fraud

Jonathan212 · May 28, 2019

How many values of K and results do you need to calculate probability of fraud with decent accuracy?

mfb · May 28, 2019

Do you know how many people play each time?

Your question is too broad to answer.

Dale · May 28, 2019

Jonathan212 said:

How many values of K and results do you need to calculate probability of fraud with decent accuracy?

I don’t think that is possible. What you can calculate is the probability of the observed periods between wins according to the model that K is constant and all lottery picks are random. That will undoubtedly be some very low probability. But as we mentioned above there are many ways that model could be wrong besides fraud.

Jonathan212 · May 29, 2019

K is not constant we said. It is given for each week. 100,123 the first week, 192,321 the second week, 255,233 the third week etc. Just 3 weeks is way too small a sample. But a year's worth, maybe. Presented with 50 values of K and 50 values of results (=number of winners this week), what is the probability that this happened by chance? If all K's were of the order of 100-1000, the probability that a win occurs every week is close to 0. A win once a year is probably close to 0 too. But with higher K's we need a mathematician.

Reminds me of drug testing against placebos. Someone must have heard of "statistical significance". It's figures like "< 0.001". Looks like a probability.

mfb · May 29, 2019

Jonathan212 said:

Presented with 50 values of K and 50 values of results (=number of winners this week), what is the probability that this happened by chance?

No matter what K is it will be tiny because it is not the right question.
An equivalent question: Given a sequence of 20 coin tosses (HHTHTHTTHHHTTHTHHTHH), what is the probability that this happened? 1/2²⁰ or about 1 in a million. Should we be surprised by this particular result? No. All ~1 million possible sequences have this probability and the one I selected is nothing special.

What you need is the probability "this result or more extreme" where "more extreme" is to be defined. In the coin toss example you could ask "I got 15 times heads, how likely is it that I get 15 or more times the same result?" For the lottery you could consider the total number of wins: How many do you expect given the number of players, how many times did someone win? How likely is it to get so few wins or even fewer? So many or even more?
If you want to look for a pattern of "there is a winner if the jackpot is high" then it gets more complicated to define what we are looking for. This has to be done before analyzing the actual results, otherwise you might bias yourself by selecting a question specifically to find something unusual.

Dale · May 29, 2019

Jonathan212 said:

Looks like a probability.

It is a probability. It is the probability that data this “extreme” occurred by chance given your data generating model. This is not the same as the probability that fraud was committed.

Jonathan212 · May 29, 2019

This is exactly what they do with drugs in preliminary studies, they repeat the experiment 50 times or so and look at the results, just like we look at results of the lottery after the event. An extreme drug outcome would be all lab rats are cured. An extreme lottery outcome would be a single win every single week while K varies from 100 to 1000, which would be fishy as hell. Less extreme lab outcome, 60% of rats cured. Less extreme lottery outcome, you name it. Probability of fraud is a very realistic target, just like probability that the drug is NOT useless.

Jonathan212 · May 29, 2019

Let's say a coin is a magnet and you throw it on a table with a huge but weak magnet underneath whereby the north pole faces upwards. We expect more of one face. 100 tosses with 60 heads, is a better result that 10 tosses with 6 heads if you were to bet your money where the dice's north pole is after the event. Or you don't know if there's any magnet involved, you don't know if there is a fraud. What is the probability of fraud with 6 heads and what is it with 60 heads?

Dale · May 29, 2019

Jonathan212 said:

Probability of fraud is a very realistic target, just like probability that the drug is NOT useless.

That is not the probability that is measured. In medical testing the p value you are talking about is not the probability that the drug has no effect. It is the probability that the data would be that extreme given that the drug has no effect.

If you are familiar with probability notation, a p value gives you P(D|H) which in words is the probability of the data given your hypothesis.

What you are asking about is the opposite. P(H|D) is the probability of the hypothesis given the data. You would need Bayesian methods for that.

Jonathan212 · May 29, 2019

Oopsa, I wrote "NOT useless" but I meant useless for the purposes tested. What's the Bayesian approach to the magnetic coin?

Dale · May 29, 2019

Jonathan212 said:

What's the Bayesian approach to the magnetic coin?

Here is a good tutorial on the topic.

https://www.quantstart.com/articles...a-Binomial-Proportion-The-Analytical-Approach
One important concept in Bayesian statistics is the idea of a prior probability. It is a mathematical expression of your beliefs before looking at the data. So, in this case, do we go in assuming that this coin is probably like most coins or do we come in suspicious that this coin may not be typical?

Whatever our prior beliefs are, we express it as a beta distributed random variable, ##Beta(\alpha,\beta)##. Then, after we do the experiment we update our posterior beliefs as ##Beta(\alpha+n,\beta+m)## where n is the number of heads and m is the number of tails.

Jonathan212 · May 30, 2019

If we have no beliefs, no assumptions like "the coin is magnetic" or "it is biased to produce heads 65% of the time", is the bayesian approach hopeless?

PeroK · May 30, 2019

Jonathan212 said:

If we have no beliefs, no assumptions like "the coin is magnetic" or "it is biased to produce heads 65% of the time", is the bayesian approach hopeless?

You could start with the belief that the lottery is a fraud and test that hypothesis. Then you would have to describe the data that would support your hypothesis and test for that. What data would indicate a fraud?

1) Your first claim is that the organisers are paying the winnings to themselves.

There would be no data as such to support this. Instead, you would need to investigate the list of winners and "follow the money" as they say. I suggest you pass any information you have on this to the police in your country.

2) Your second claim is that the organisers are controlling the weeks on which a win takes place. a) that wins are rare on the first week of a cycle; b) that wins are too frequent on the second week of a cycle; c) that wins are too frequent on the third week of a cycle.

This should be easy to test if you have access to the number of tickets bought every week. All you really need to look at is how often a win takes place each week of a cycle and whether this is consistent with the number of tickets bought on those weeks.

Note, however, that as others have said: if you study a set of data looking for any statistical anomalies and then test for those, then that is a meaningless approach. Instead, you should have a good idea of what you want to test before you look at the data.

Jonathan212 · May 30, 2019

"Your first claim is that the organisers are paying the winnings to themselves. There would be no data as such to support this."

Of course there would be data to support the accusation if they paid the winnings to themselves TOO frequently or too anomalously, it probably wouldn't be evidence good enough for court but it would be evidence good enough for us mathematically oriented guys and anyone who would care to check our calculations. Btw, it is extremely easy for the organizers to pay the winnings to themselves if they wanted to, except they wouldn't put it in their... tax return, the money would be won by a thug of theirs and laundered and spent through offshore accounts. The state could easily prevent any fraud simply by forcing them to give the police a complete list of numbers played each week, so a cop could search for the winning numbers after each draw and the organizers would not be able to subsequently add tickets which is what we are accusing them of here.

"Your second claim is that the organizers are controlling the weeks on which a win takes place."

Yes but by paying the winnings to their thugs on those weeks, I don't know why you consider this a separate claim. By the way, wins are too frequent near the 5th draw after a win (2 draws per week as I said initially but let's keep it simple and pretend it's one draw per week, a win every 5 weeks). Such a peak is expected as the amount to be won accumulates and more and more tickets are sold. But it may occur too early if not enough people play and the organizers cheat to prevent demotivation and a collapse of sales.

mfb · May 30, 2019

PeroK said:

Instead, you should have a good idea of what you want to test before you look at the data.

That won't be perfect as the suspicion for fraud comes from that data already. Using future data only would be perfect but that would take a long time.
50 draws on record still leave a lot of room to detect fraud if it is too obvious.

I suggest the following two tests:
- Sum all K for drawings in drawing 1 to 3. The number of winners should be a Poisson distribution. Calculate the probability that there are as many or more winners as observed. The suspected fraud shouldn't influence this number.
- Sum all K for drawings in drawing 4 and higher. The number of winners should be a Poisson distribution. Calculate the probability that there are as many or more winners as observed. The suspected fraud adds winners here.

And one "exploratory" approach: For each time between wins, calculate how many tickets have been sold before someone won. Consider half the tickets for the drawing where someone won. Make a plot of "number of rounds surviving" as function of the number of tickets sold. It should be roughly an exponential distribution. If it deviates too much from that it is suspicious (but not quantified).

Dale · May 30, 2019

Jonathan212 said:

If we have no beliefs, no assumptions like "the coin is magnetic" or "it is biased to produce heads 65% of the time", is the bayesian approach hopeless?

Typically you use what is called an uninformed prior. You say something like the coin is biased to produce heads between 0% and 100% with uniform probability of any value in that range.

Jonathan212 · May 30, 2019

Got myself a year's worth of K's. Surprise, they do not increase as the accumulated money increases, they decrease!

Dale · May 30, 2019

Could the data be sorted backwards from what you expect?

Dale · May 30, 2019

Jonathan212 said:

"Your first claim is that the organisers are paying the winnings to themselves. There would be no data as such to support this."

Of course there would be data to support the accusation if they paid the winnings to themselves TOO frequently or too anomalously, it probably wouldn't be evidence good enough for court but it would be evidence good enough for us mathematically oriented guys and anyone who would care to check our calculations. Btw, it is extremely easy for the organizers to pay the winnings to themselves if they wanted to, except they wouldn't put it in their... tax return, the money would be won by a thug of theirs and laundered and spent through offshore accounts. The state could easily prevent any fraud simply by forcing them to give the police a complete list of numbers played each week, so a cop could search for the winning numbers after each draw and the organizers would not be able to subsequently add tickets which is what we are accusing them of here.

"Your second claim is that the organizers are controlling the weeks on which a win takes place."

Yes but by paying the winnings to their thugs on those weeks, I don't know why you consider this a separate claim. By the way, wins are too frequent near the 5th draw after a win (2 draws per week as I said initially but let's keep it simple and pretend it's one draw per week, a win every 5 weeks). Such a peak is expected as the amount to be won accumulates and more and more tickets are sold. But it may occur too early if not enough people play and the organizers cheat to prevent demotivation and a collapse of sales.

I think that these claims are off topic here. We can talk about the statistics. But evidence for these claims would not be statistical, it would come through forensic accounting and police investigation, neither of which we do here.

Let’s just stick with the statistical modeling here and not discuss fraud. Any further posts regarding fraud will be deleted.

PeroK · May 30, 2019

Dale said:

I think that these claims are off topic here. We can talk about the statistics. But evidence for these claims would not be statistical, it would come through forensic accounting and police investigation, neither of which we do here.

Let’s just stick with the statistical modeling here and not discuss fraud. Any further posts regarding fraud will be deleted.

I'm not sure I'm following all of this, but it's my understanding that many people tend to choose the same sort of numbers in the lottery. Numbers related to birthdays etc. So, you might expect that as more tickets get sold, you get more duplicates and not so many "new" numbers.

You also have to take into account: for small numbers of ticket sales the chance of there being a winning ticket increases approximately in proportion to the number of tickets sold; but, as the number of tickets sold increases the chance of there being a winning ticket increases more slowly. Even without any bias towards certain numbers.

In short, you would need some analysis of the numbers people tend to pick in addition to the total number of ticket sales to calculate the probability of there being a winning ticket on a given week.

Dale · May 30, 2019

PeroK said:

you would need some analysis of the numbers people tend to pick in addition to the total number of ticket sales to calculate the probability of their being a winning ticket on a given week.

I agree. I was trying to convey that point earlier also.

Jonathan212 · May 30, 2019

It's definitely in the right order. It is the whole of year 2018. You shouldn't try to isolate statistics from real life considerations when faced with this table of K values that scream for a psychological explanation: lottery customers are getting de-motivated (K decreases) by default as time passes, even with an increasing prize! And it is only through massive advertising that the game is resuscitated periodically and you get the spikes. Additionally the organizers may legally buy lots of tickets themselves when sales go too low (it costs them nothing) in order to produce winners to show to the media and push the narrative that the high prize made everyone rush to buy tickets.

PeroK · May 30, 2019

Jonathan212 said:

It's definitely in the right order. It is the whole of year 2018. You shouldn't try to isolate statistics from real life considerations when faced with this table of K values that scream for a psychological explanation: lottery customers are getting de-motivated by default as time passes, even with an increasing prize, and it is only through massive advertizing that the game is resuscitated periodically and you get the spikes, plus the organizers may also legally buy lots of tickets themselves (it costs them nothing) to produce winners for the media and push the narrative that the high prize made everyone rush to buy tickets.

What on Earth are you talking about? There is not an iota of mathematics in that post.

Jonathan212 · May 30, 2019

Let's take it step by step. Prize goes up, number of tickets K goes down. Then K suddenly jumps up 300%. Over and over and over. We want to establish from the graph whether this is anomalous statistically and fit mathematical models to theories about its cause.

mfb · May 30, 2019

The correlation between the numbers people pick will increase the variance of the number of winners a bit, but probably not too much (unless we are really unlucky, but outliers can be removed).

Jonathan212 said:

We want to establish from the graph whether it is anomalous statistically and fit mathematical models to it.

Wait... we can't do that for K. It will depend on the price money, advertisement and many more that we can't control. We can only see if the number of winners is realistic given the values of K.

Jonathan212 · May 30, 2019

Shall I graph the number of winners too? It's 1 wherever you see a peak and rarely 2 or more.

mfb · May 30, 2019

Well, without the list of winners we can't determine if there are more or fewer winners than expected, obviously.
A table or other format that is easy to parse would be useful, too.

Jonathan212 · May 31, 2019

"We can only see if the number of winners is realistic given the values of K."

That's very much the gist of it in the end. Let's see. Got 16 years worth of data now, except it's from another lottery where you choose 6 numbers out of 50. How do we use the attached table to detect the specific fraud where the organizers add a winning ticket after the draw?

Jonathan212 · May 31, 2019

It may be simple: out of 1707 draws, 1356 draws produced no winner. That's no winner 79.4% of the time. Was a higher percentage expected given the average K of 6,126,358 and given (50 choose 6) = 15,890,700?

PeroK · May 31, 2019

Jonathan212 said:

It may be simple: out of 1707 draws, 1356 draws produced no winner. That's no winner 79.4% of the time. Was a higher percentage expected given the average K of 6,126,358 and given (50 choose 6) = 15,890,700?

If there is a winner about 20% of the time, then that implies that on average about 20% of the possible sets of numbers are covered. That's about 3.2 million different combinations.

Your figures suggest, therefore, that although 6.1 million tickets are sold, they represent only about 3.2 million combinations. I read yesterday that about 10,000 people play 1, 2, 3, 4, 5, 6 every week, for example. In any case, that would be the likely explanation. With 6 million random tickets I would expect about 5 million different combinations (rough guess). So, these figures are consistent with the hypothesis that players do not chose at random but typically favour certain types of combination.

The only way to verify this, of course, is to obtain figures for the number of combinations typically chosen on a weekly basis.

Note that with these figures, you will have to change your accusation to one where the operators suppress wins - there is no evidence here of excessive wins. It's how few wins there are given the ticket sales that needs to be explained.

PS the above data is consistent with there being an average of 2 winners each time the lottery is won. I.e. as there are 351 weeks when there was a winner there should be about 700 winners in total. Is that data available?

Jonathan212 · May 31, 2019

If you import the above text file to Excel and do the average of W when W > 0, it's 1.32. Not sure why you want that. The number of 1-winner draws is 280, the number of 2-winner draws is 55 etc. It's all in the summary at the beginning and the raw data is further down.

PeroK · May 31, 2019

Jonathan212 said:

If you import the above text file to Excel and do the average of W when W > 0, it's 1.32. Not sure why you want that. The number of 1-winner draws is 280, the number of 2-winner draws is 55 etc. It's all in the summary at the beginning and the raw data is further down.

There are fewer than 500 winners. That suggests that there may be certain combinations - possibly a relatively small number - with a lot of tickets. And that none of these tickets has won yet. At some time, however, one of these tickets will win and create a large number of winners that week. This would bring the average back towards 2 per win.

There may be another explanation. But, if there really are 10,000 people playing 1, 2, 3, 4, 5, 6 every week, then this is a possible explanation.

Jonathan212 · May 31, 2019

If we want to assess a single draw, how extreme a single draw is, given K for this draw, what's the proper way to do it?

(50 choose 6)/K must be ok as a factor for small K's, but it can't be right for K=(50 choose 6) even if people choose with a random number generator because even the random number generator will produce duplicates.

PeroK · May 31, 2019

Jonathan212 said:

If we want to assess a single draw, how extreme a single draw is, given K for this draw, what's the proper way to do it?

(50 choose 6)/K must be ok as a factor for small K's, but it can't be right for K=(50 choose 6) even if people choose with a random number generator because even the random number generator will produce duplicates.

If you only know how many tickets have been sold, but not how widely the tickets are distributed, then there is no way to predict the frequency of a lottery win. But, the total number of winners -over a potentially long time - should be more predictable.

Take an example of a lottery with 100 tickets and 50 players. If, for whatever reason, they all have different numbers, then you'll get one win every two weeks on average; and, only ever one winner.

At the other extreme, if they all have the same numbe, then you will only get one win every 100 weeks, but 50 winners every time.

And, if there is something between the two, with perhaps 40 different numbers, then you will get a win less than once every two weeks but sometimes more than one winner.

The common factor is the total number of winners, which relates only to the total number of tickets sold.

In the real lottery, out of 6.1 tickets sold, you might have only 3.2 million different numbers. Most of these would be held by only a few players: perhaps 1-5. But, some special "lucky" numbers might be held by thousands of different players. This could result in the pattern from your data. Most weeks there are a small numbers of winners, but if the lottery is played long enough, eventually one of the commonly held numbers will turn up and you'll get hundreds or thousands of winners.

In this case, it may take a long time for the number of winners to average out to match the ticket sales.

In the meantime, there is no definite, immediate way to know for sure why there are so few winners - given the number of ticket sales.

Jonathan212 · May 31, 2019

"there is no definite, immediate way to know for sure why there are so few winners - given the number of ticket sales."

Alright, I'm with you on this one. Going back to your simplified lottery, the extremes are

1. a win every 2 weeks
2. 50 wins every 100 weeks

So if we observe a win every single week, that's outside the above range and an anomaly, right? An extreme like the drug extremes previously mentioned. Can't we assign it a number like "p<0.001"?

PeroK · May 31, 2019

Jonathan212 said:

"there is no definite, immediate way to know for sure why there are so few winners - given the number of ticket sales."

Alright, I'm with you on this one. Going back to your simplified lottery, the extremes are

1. a win every 2 weeks
2. 50 wins every 100 weeks

So if we observe a win every single week, that's outside the above range and an anomaly, right? An extreme like the drug extremes previously mentioned. Can't we assign it a number like "p<0.001"?

If you had a win every week, then over time your confidence that the lottery was properly adminstered would reduce.

You're confusing probabilities with confidences.

Jonathan212 · May 31, 2019

Could go in the opposite direction. Assume numbers 1-30 are f times more popular than the rest and calculate f from the observations of W versus K, starting with W = 0.

mfb · Jun 1, 2019

If I interpret the txt right we had 10,457,692,468 tickets sold (that is a lot!) and 465 winners. At 1 in 15,890,700 we would expect 658 winners. To explain this difference with random chance we need a significant share of tickets going to a very small share of combinations. The 8 winners with the very small number of tickets sold (5.8 million) points in this direction, although I would (without calculating) expect more outliers.

Jonathan212 · Jun 2, 2019

The 8 winners with the very small number of tickets sold (5.8 million)

Here's the winning numbers at that draw, played in 8 different tickets.

34 27 13 17 6 13

Surprise, it can't be birthday numbers. It's as if someone knew what would happen and bought the same combination 8 times to ensure he wouldn't have to share too much of the prize.

Jonathan212 · Jun 2, 2019

What is the statistical significance of 465 instead of 658 winners? I think that is:

P( number of winners <= 465 | all numbers are equally popular )

PeroK · Jun 2, 2019

Jonathan212 said:

What is the statistical significance of 465 instead of 658 winners? I think that is:

P( number of winners <= 465 | all numbers are equally popular )

If the hypothesis is that ticket numbers were chosen at random (or equally popular), then that hypothesis would be false with almost 100% confidence. The calculated probability above would be close to zero.

But, we already know that numbers are chosen by people with certain biases. The data, from that point of view, tells us nothing. We would need many more weeks (millions perhaps) to see the full picture.

If you knew the distribution of numbers chosen each week, then you could test the hypothesis that the lottery is fair. Or, you could wait a few hundred million weeks or so.

Jonathan212 · Jun 2, 2019

"Or, you could wait a few hundred million weeks"

But whence that figure of a few hundred million?

PeroK · Jun 2, 2019

Jonathan212 said:

"Or, you could wait a few hundred million weeks"

But whence that figure of a few hundred million?

There are 1`5 million possible numbers. If a small number are very popular, let's d say 10, then one of these most popular numbers comes up only once every 1.5 million weeks.

If, for example, about 10,000 people choose 1, 2, 3, 4, 5, 6 every week, then either you look at the numbers chosen to see this; or, you run the lottery millions of times until this combination comes up and you get the data via the 10,000 winners that week.

I Mean time between lottery wins and probability of fraud by organizers

Attachments

Attachments

Similar threads

Hot Threads

B A Little Probability Puzzle

I Need help solving this Existence Algorithm for truth

A Does this computation satisfy LTL formulas?

A Prove that points which are indistinguishable from 0 exist (using logic)

A Mathematical Connection between Cosmic Expansion and Exponential Growth

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective