Mean time between lottery wins and probability of fraud by organizers

Jonathan212 · May 28, 2019

Looked at some lottery wins and something was fishy. This a lottery where you pick 5 numbers out of the set (1,2, ..., 50). When no one wins, the money goes to the next iteration of the game so the prize gets bigger and bigger. It seemed that a win was too regular around every 2 or 3 weeks and never occurred in consecutive draws. As if people were waiting for the money to accumulate, which is probably true but there is another possibility also, and that is fraud by the organizers: too few people play, no wins occur at all for ages, and because this demotivates players and could reduce sales to a possible collapse, the organizers cheat and win the prize themselves periodically. How would you investigate this mathematically based on the observed distribution of time between wins? What is the expected mean and standard deviation of the time between (full-match) wins? Given the number of 5-number sets played in each iteration: K = 100,000 and iterations per week = 2.

The observed time between wins seems to have a sharp distribution with a mean around 2.5 weeks and this is fishy. Based on an observation like this, what is the probability that the organizers cheat?

What should K be to match the observed mean time between wins?

Dale · May 28, 2019

Jonathan212 said:

What is the expected mean and standard deviation of the time between (full-match) wins?

I don’t think the answer can be determined without knowing the number of tickets sold. If K fluctuates then it would be complicated.

Jonathan212 · May 28, 2019

Sure, just pretending we know K and it is fixed, just to derive the formula. Then stick in some real numbers. Or start from the 2.5 weeks and derive a fixed K, then work out standard deviation around the 2.5 weeks and compare this with observed deviation.

mfb · May 28, 2019

There are (50 choose 5) options to pick lottery numbers, on average you expect a jackpot winner every (50 choose 5)/K weeks (shared jackpots count as multiple winners). Plug in numbers and see if they look realistic. The assumption of a constant number of players is very unrealistic, however. Larger jackpots attract more players.

Sometimes so "many players" that they change the lottery (long article, skip to the first mention of "Winfall" for the lottery part).

Jonathan212 · May 28, 2019

What's (50 choose 5) short for?

mfb · May 28, 2019

The number of ways to pick 5 elements out of 50 where the order doesn't matter. Wikipedia has formulas

Dale · May 28, 2019

You can use the binomial distribution to calculate the number of likely winners each week. For that you need the number of players (K) and the probability of each player winning (1/[50 choose 5]). Then the probability of at least 1 person winning is 1 - the probability of 0 people winning. That latter probability is exactly what the binomial distribution gives.

That gives you the probability of a winner each draw, call that P. The probability that someone wins in the Nth week is then ##(1-P)^{N-1}P##

However, if you detect a significant discrepancy from this model then it is not an indication of fraud. This model assumes K is constant and that the players number selection is independent. Neither of those is an indication of fraud

Jonathan212 · May 28, 2019

How many values of K and results do you need to calculate probability of fraud with decent accuracy?

mfb · May 28, 2019

Do you know how many people play each time?

Your question is too broad to answer.

Dale · May 28, 2019

Jonathan212 said:

How many values of K and results do you need to calculate probability of fraud with decent accuracy?

I don’t think that is possible. What you can calculate is the probability of the observed periods between wins according to the model that K is constant and all lottery picks are random. That will undoubtedly be some very low probability. But as we mentioned above there are many ways that model could be wrong besides fraud.

Jonathan212 · May 29, 2019

K is not constant we said. It is given for each week. 100,123 the first week, 192,321 the second week, 255,233 the third week etc. Just 3 weeks is way too small a sample. But a year's worth, maybe. Presented with 50 values of K and 50 values of results (=number of winners this week), what is the probability that this happened by chance? If all K's were of the order of 100-1000, the probability that a win occurs every week is close to 0. A win once a year is probably close to 0 too. But with higher K's we need a mathematician.

Reminds me of drug testing against placebos. Someone must have heard of "statistical significance". It's figures like "< 0.001". Looks like a probability.

mfb · May 29, 2019

Jonathan212 said:

Presented with 50 values of K and 50 values of results (=number of winners this week), what is the probability that this happened by chance?

No matter what K is it will be tiny because it is not the right question.
An equivalent question: Given a sequence of 20 coin tosses (HHTHTHTTHHHTTHTHHTHH), what is the probability that this happened? 1/2²⁰ or about 1 in a million. Should we be surprised by this particular result? No. All ~1 million possible sequences have this probability and the one I selected is nothing special.

What you need is the probability "this result or more extreme" where "more extreme" is to be defined. In the coin toss example you could ask "I got 15 times heads, how likely is it that I get 15 or more times the same result?" For the lottery you could consider the total number of wins: How many do you expect given the number of players, how many times did someone win? How likely is it to get so few wins or even fewer? So many or even more?
If you want to look for a pattern of "there is a winner if the jackpot is high" then it gets more complicated to define what we are looking for. This has to be done before analyzing the actual results, otherwise you might bias yourself by selecting a question specifically to find something unusual.

Dale · May 29, 2019

Jonathan212 said:

Looks like a probability.

It is a probability. It is the probability that data this “extreme” occurred by chance given your data generating model. This is not the same as the probability that fraud was committed.

Jonathan212 · May 29, 2019

This is exactly what they do with drugs in preliminary studies, they repeat the experiment 50 times or so and look at the results, just like we look at results of the lottery after the event. An extreme drug outcome would be all lab rats are cured. An extreme lottery outcome would be a single win every single week while K varies from 100 to 1000, which would be fishy as hell. Less extreme lab outcome, 60% of rats cured. Less extreme lottery outcome, you name it. Probability of fraud is a very realistic target, just like probability that the drug is NOT useless.

Jonathan212 · May 29, 2019

Let's say a coin is a magnet and you throw it on a table with a huge but weak magnet underneath whereby the north pole faces upwards. We expect more of one face. 100 tosses with 60 heads, is a better result that 10 tosses with 6 heads if you were to bet your money where the dice's north pole is after the event. Or you don't know if there's any magnet involved, you don't know if there is a fraud. What is the probability of fraud with 6 heads and what is it with 60 heads?

Dale · May 29, 2019

Jonathan212 said:

Probability of fraud is a very realistic target, just like probability that the drug is NOT useless.

That is not the probability that is measured. In medical testing the p value you are talking about is not the probability that the drug has no effect. It is the probability that the data would be that extreme given that the drug has no effect.

If you are familiar with probability notation, a p value gives you P(D|H) which in words is the probability of the data given your hypothesis.

What you are asking about is the opposite. P(H|D) is the probability of the hypothesis given the data. You would need Bayesian methods for that.

Jonathan212 · May 29, 2019

Oopsa, I wrote "NOT useless" but I meant useless for the purposes tested. What's the Bayesian approach to the magnetic coin?

Dale · May 29, 2019

Jonathan212 said:

What's the Bayesian approach to the magnetic coin?

Here is a good tutorial on the topic.

https://www.quantstart.com/articles...a-Binomial-Proportion-The-Analytical-Approach
One important concept in Bayesian statistics is the idea of a prior probability. It is a mathematical expression of your beliefs before looking at the data. So, in this case, do we go in assuming that this coin is probably like most coins or do we come in suspicious that this coin may not be typical?

Whatever our prior beliefs are, we express it as a beta distributed random variable, ##Beta(\alpha,\beta)##. Then, after we do the experiment we update our posterior beliefs as ##Beta(\alpha+n,\beta+m)## where n is the number of heads and m is the number of tails.

Jonathan212 · May 30, 2019

If we have no beliefs, no assumptions like "the coin is magnetic" or "it is biased to produce heads 65% of the time", is the bayesian approach hopeless?

PeroK · May 30, 2019

Jonathan212 said:

If we have no beliefs, no assumptions like "the coin is magnetic" or "it is biased to produce heads 65% of the time", is the bayesian approach hopeless?

You could start with the belief that the lottery is a fraud and test that hypothesis. Then you would have to describe the data that would support your hypothesis and test for that. What data would indicate a fraud?

1) Your first claim is that the organisers are paying the winnings to themselves.

There would be no data as such to support this. Instead, you would need to investigate the list of winners and "follow the money" as they say. I suggest you pass any information you have on this to the police in your country.

2) Your second claim is that the organisers are controlling the weeks on which a win takes place. a) that wins are rare on the first week of a cycle; b) that wins are too frequent on the second week of a cycle; c) that wins are too frequent on the third week of a cycle.

This should be easy to test if you have access to the number of tickets bought every week. All you really need to look at is how often a win takes place each week of a cycle and whether this is consistent with the number of tickets bought on those weeks.

Note, however, that as others have said: if you study a set of data looking for any statistical anomalies and then test for those, then that is a meaningless approach. Instead, you should have a good idea of what you want to test before you look at the data.

Jonathan212 · May 30, 2019

"Your first claim is that the organisers are paying the winnings to themselves. There would be no data as such to support this."

Of course there would be data to support the accusation if they paid the winnings to themselves TOO frequently or too anomalously, it probably wouldn't be evidence good enough for court but it would be evidence good enough for us mathematically oriented guys and anyone who would care to check our calculations. Btw, it is extremely easy for the organizers to pay the winnings to themselves if they wanted to, except they wouldn't put it in their... tax return, the money would be won by a thug of theirs and laundered and spent through offshore accounts. The state could easily prevent any fraud simply by forcing them to give the police a complete list of numbers played each week, so a cop could search for the winning numbers after each draw and the organizers would not be able to subsequently add tickets which is what we are accusing them of here.

"Your second claim is that the organizers are controlling the weeks on which a win takes place."

Yes but by paying the winnings to their thugs on those weeks, I don't know why you consider this a separate claim. By the way, wins are too frequent near the 5th draw after a win (2 draws per week as I said initially but let's keep it simple and pretend it's one draw per week, a win every 5 weeks). Such a peak is expected as the amount to be won accumulates and more and more tickets are sold. But it may occur too early if not enough people play and the organizers cheat to prevent demotivation and a collapse of sales.

mfb · May 30, 2019

PeroK said:

Instead, you should have a good idea of what you want to test before you look at the data.

That won't be perfect as the suspicion for fraud comes from that data already. Using future data only would be perfect but that would take a long time.
50 draws on record still leave a lot of room to detect fraud if it is too obvious.

I suggest the following two tests:
- Sum all K for drawings in drawing 1 to 3. The number of winners should be a Poisson distribution. Calculate the probability that there are as many or more winners as observed. The suspected fraud shouldn't influence this number.
- Sum all K for drawings in drawing 4 and higher. The number of winners should be a Poisson distribution. Calculate the probability that there are as many or more winners as observed. The suspected fraud adds winners here.

And one "exploratory" approach: For each time between wins, calculate how many tickets have been sold before someone won. Consider half the tickets for the drawing where someone won. Make a plot of "number of rounds surviving" as function of the number of tickets sold. It should be roughly an exponential distribution. If it deviates too much from that it is suspicious (but not quantified).

Dale · May 30, 2019

Jonathan212 said:

If we have no beliefs, no assumptions like "the coin is magnetic" or "it is biased to produce heads 65% of the time", is the bayesian approach hopeless?

Typically you use what is called an uninformed prior. You say something like the coin is biased to produce heads between 0% and 100% with uniform probability of any value in that range.

Jonathan212 · May 30, 2019

Got myself a year's worth of K's. Surprise, they do not increase as the accumulated money increases, they decrease!

Dale · May 30, 2019

Could the data be sorted backwards from what you expect?

Dale · May 30, 2019

Jonathan212 said:

"Your first claim is that the organisers are paying the winnings to themselves. There would be no data as such to support this."

Of course there would be data to support the accusation if they paid the winnings to themselves TOO frequently or too anomalously, it probably wouldn't be evidence good enough for court but it would be evidence good enough for us mathematically oriented guys and anyone who would care to check our calculations. Btw, it is extremely easy for the organizers to pay the winnings to themselves if they wanted to, except they wouldn't put it in their... tax return, the money would be won by a thug of theirs and laundered and spent through offshore accounts. The state could easily prevent any fraud simply by forcing them to give the police a complete list of numbers played each week, so a cop could search for the winning numbers after each draw and the organizers would not be able to subsequently add tickets which is what we are accusing them of here.

"Your second claim is that the organizers are controlling the weeks on which a win takes place."

Yes but by paying the winnings to their thugs on those weeks, I don't know why you consider this a separate claim. By the way, wins are too frequent near the 5th draw after a win (2 draws per week as I said initially but let's keep it simple and pretend it's one draw per week, a win every 5 weeks). Such a peak is expected as the amount to be won accumulates and more and more tickets are sold. But it may occur too early if not enough people play and the organizers cheat to prevent demotivation and a collapse of sales.

I think that these claims are off topic here. We can talk about the statistics. But evidence for these claims would not be statistical, it would come through forensic accounting and police investigation, neither of which we do here.

Let’s just stick with the statistical modeling here and not discuss fraud. Any further posts regarding fraud will be deleted.

PeroK · May 30, 2019

Dale said:

I think that these claims are off topic here. We can talk about the statistics. But evidence for these claims would not be statistical, it would come through forensic accounting and police investigation, neither of which we do here.

Let’s just stick with the statistical modeling here and not discuss fraud. Any further posts regarding fraud will be deleted.

I'm not sure I'm following all of this, but it's my understanding that many people tend to choose the same sort of numbers in the lottery. Numbers related to birthdays etc. So, you might expect that as more tickets get sold, you get more duplicates and not so many "new" numbers.

You also have to take into account: for small numbers of ticket sales the chance of there being a winning ticket increases approximately in proportion to the number of tickets sold; but, as the number of tickets sold increases the chance of there being a winning ticket increases more slowly. Even without any bias towards certain numbers.

In short, you would need some analysis of the numbers people tend to pick in addition to the total number of ticket sales to calculate the probability of there being a winning ticket on a given week.

Dale · May 30, 2019

PeroK said:

you would need some analysis of the numbers people tend to pick in addition to the total number of ticket sales to calculate the probability of their being a winning ticket on a given week.

I agree. I was trying to convey that point earlier also.

Jonathan212 · May 30, 2019

It's definitely in the right order. It is the whole of year 2018. You shouldn't try to isolate statistics from real life considerations when faced with this table of K values that scream for a psychological explanation: lottery customers are getting de-motivated (K decreases) by default as time passes, even with an increasing prize! And it is only through massive advertising that the game is resuscitated periodically and you get the spikes. Additionally the organizers may legally buy lots of tickets themselves when sales go too low (it costs them nothing) in order to produce winners to show to the media and push the narrative that the high prize made everyone rush to buy tickets.

PeroK · May 30, 2019

Jonathan212 said:

It's definitely in the right order. It is the whole of year 2018. You shouldn't try to isolate statistics from real life considerations when faced with this table of K values that scream for a psychological explanation: lottery customers are getting de-motivated by default as time passes, even with an increasing prize, and it is only through massive advertizing that the game is resuscitated periodically and you get the spikes, plus the organizers may also legally buy lots of tickets themselves (it costs them nothing) to produce winners for the media and push the narrative that the high prize made everyone rush to buy tickets.

What on Earth are you talking about? There is not an iota of mathematics in that post.

Mean time between lottery wins and probability of fraud by organizers

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Attachments

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad Understanding permutations and combinations in a coin toss experiment

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect