# Micromass' big statistics challenge

• Challenge
• micromass
In summary, we discussed various open-ended questions related to probability theory and statistics. Our goal was to find strategies and models to answer these questions and provide reasoning for their plausibility. We also allowed the use of outside sources, as long as they were properly referenced. Some of the questions we covered include estimating the number of fish in a lake, identifying a real coin toss experiment from a human-invented one, and determining the optimal number of seats on a train based on daily ridership. We also examined a scenario involving a person playing a game with a psychic, and discussed strategies for distinguishing between randomly generated text and real text.
Number Nine said:
Another way of doing #2

If the sequence was generated by a coin toss, then it is ##Binomial(199,0.5)##. Since each flip is independent, the probability that flip ##n+1## is identical to flip ##n## is ##0.5##, so the number of repetitions is ##Binomial(198,0.5)##.

Let ##k## be the number of repetitions. We want to estimate ##p = 0.5##.
The first sequence contains 117 repetitions, which gives (under a uniform prior) a posterior distribution of ##Beta(1 + 117, 198 - 117 + 1)##, which is greater than ##.5## with probability ##>0.99##.

The second contains 96 repetitions, which gives (under a uniform prior) a posterior distribution of ##Beta(1 + 96, 198 - 96 + 1)##, which is compatible with ##p = 0.5##.
This is compatible with my ##χ^2## Test (posts #2 and #9). However, it seems to lead to the wrong answer.

I posted about #2 in posts 49, 52, and 59, but I didn't realize each sequence is 200 characters. I only analyzed the first 100 because the second 100 in each sequence requires moving the bottom across. I counted "changes of state" in post 59 which is essentially ## k=( 199- ## finding it the same) that @Number Nine did in post 66, (and is also binomial). In any case, using Number Nine's counting, there are 82 "changes of state' and ## \sigma^2=Npq=199(1/2)(1/2)=50 ##. This gives ## \sigma=7.1 ## and ## z=(100-82)/7.1=2.6 ##. (The binomial statistics approach the gaussian for the z values for large N. The gaussian tells us a z of 2.6 is rather unlikely.)

Last edited:
Here is my statistical analysis for the coin tosses (problem 2). Both sequences have a length of 199. Real coin tosses are independent, we expect:

- about the same number of H and T. The first has 91 T, the second has 94, we expect 99.5, both are reasonable.
- about 50% probability that two subsequent tosses have the same result. The first sequence has this 117 out of 198 times, the second one 95 times out of 198. That 117 is a bit high, let's check further.
- sequences TTT or HHH should occur on average 197/4=49.25 times. The first sequence has this 63 times, the second one 45 times. First is a bit high here as well, but the numbers are strongly correlated.
- sequences TTTT or HHHH? We expect 24.5, we get 35 and 12 respectively. Both are off.
- T^5 or H^5? We expect 12.2, we get 20 and 1. The probability to have zero or one sequence of that length is 9.8%. The probability to have 20 or more such sequences is 16.5%.

Okay, weird. My initial impression that the second sequence is lacking runs was right, but the other one has too many, both with still somewhat reasonable probabilities.

TxT or HxH? No anomaly. TxxT/HxxH? Also no.
THT or HTH? We expect 49.25, we have 27 and 53.
THTH or HTHT? We expect 24.5, the first sequence has 8 while the second has 27.
THTHT/HTHTH? We expect 12.2, we have 4 and 13, respectively.
6 alternating? We expect 6.1, we get 1 and 6.
HTTH or THHT? We expect 24.5, we get 26 and 17.

In principle, both sequences can be generated by a human, and both can be generated randomly, so we can never be sure. Looking at the various tests described here, sequence 1 looks more odd than sequence 2. Humans normally tend to include fewer longer runs ("TTTTT" and so on) if they try to be random, but micromass (or whoever made that problem) knows about this and can manipulate the sequence.

I would expect sequence 1 to be made by a human knowing about typical biases humans have when trying to generate random sequences. The main point is the high probability to get two identical tosses in a row, and the low number of sequences of alternating coin tosses (again, correlated of course, didn't run toy studies to make a probability out of that).

micromass said:
Take the following two sequences of coin tosses:

One of these sequences is from an actual coin toss experiment. The other is invented by a human. Find out which of these is which.
Well, after close examination, I have determined that the real coin toss was the second one. The probability string of 6 or more of the same result is less than or equal to 1/64. Still possible. Having a string of 6 or more of the same result 4 times has the probability of 1/1024-- the first one had more than 4 strings of 6 or more, the second had none of them. I thus say that the first is fake and the second is real.

Oh, mfb beat me to it. I didn't even read any of the other responses before I posted.

micromass said:
Given the following encoded text, find out whether this is a real text or randomly generated using some scheme. Attempting to decode the text doesn't count.
Man-made (real text). Definitely. A few of the most common letters are y, u, i, h and j, letters that are all next to each other. Of course, everyone thinks to put in a lot of spaces and this text has plenty on them. What letter doesn't appear at all? The letter nobody thinks about and that is at the bottom left corner of the keyboard. z. The probability of z not appearing at all is infinitesimally small.

Isaac0427 said:
Man-made (real text). Definitely. A few of the most common letters are y, u, i, h and j, letters that are all next to each other. Of course, everyone thinks to put in a lot of spaces and this text has plenty on them. What letter doesn't appear at all? The letter nobody thinks about and that is at the bottom left corner of the keyboard. z. The probability of z not appearing at all is infinitesimally small.

The text doesn't have to be generated from a uniform distribution, it could have been generated some other way. That said, I agree that it was probably man-made. The character distribution roughly matches the letter frequencies of the English language, so I assume that it was generated using some kind of substitution cipher (although, amusingly, he seems to have replaced the e's with spaces).

Isaac0427 said:
Well, after close examination, I have determined that the real coin toss was the second one. The probability string of 6 or more of the same result is less than or equal to 1/64. Still possible. Having a string of 6 or more of the same result 4 times has the probability of 1/1024-- the first one had more than 4 strings of 6 or more, the second had none of them. I thus say that the first is fake and the second is real.
I don't know where the 1/1024 comes from, but it is not right. The 1/64 applies to a specific position only.

micromass said:
That's for you to decide on. I left the question very open for this purpose. All I'm giving is the number of people on the train on several days, the question is how you would decide how many seats the train should have. Clearly we don't want it be too long since that's a waste of money. But we don't want it to be too short either, otherwise we get complaints about people not finding seats.

I know some train companies in England who would take all the seats out (if they were allowed to), have everyone stand throughout the journey in order to mininise the number of carriages and call that optimal. Then double the season-ticket prices (if they were allowed to)!

micromass
A professor got a ticket twelve times for illegal overnight parking. All twelve tickets were given either Tuesdays or Thursdays. Is it justified for him to rent a garage on these days?
Insufficient information. How much does renting a garage cost, how much do the tickets cost, averaged over all the times he parked there on Tuesdays and Thursdays?
Does he park there on other days as well? If he parks there on all work days with equal frequency and gets tickets with the same probability, the probability that all parking tickets are limited to two weekdays is just 0.00017. While there is no mathematical proof possible I would expect that those days are indeed more "dangerous" than the others. It is not unreasonable to have more checks on two specific days.

QuantumQuest
Here are my thoughts on number 10:

How did the psychic get his infallibility rating? It seems fairly obvious to take both boxes. If the psychic always predicts that the player will take both boxes, then in 99.9% of cases this is what happens. Only 1 in a thousand decides simply to take box A.

I reckon box B has $1M in it. I knew combined box problems like 10 already*, so here is an unconventional approach as a psychic is involved: I would try to find someone convinced in the psychic's ability for a$1000 to $10000 bet that the psychic is wrong, and take box B only. It is a win/win situation: Psychic put money in? I gain 990,000. Psychic didn't put money in? I gain$1000 (same as I would with taking two boxes), and collect additional evidence that psychics do not exist.

* ;)

#5
micromass said:
I have a big box filled with balls. All balls have a number. I draw 55 balls at random and record their number. They are: 1010, 5050, 104104, 130130, 213213. How many balls do you expect to be in the box?

At the beginning of the experiment, there were at least five balls. Any larger estimates require assumptions about the manner in which the balls were numbered.

3 Americans are on a train through Scotland: a statastician, a physicist and a mathematician. They all see a brown cow out the window.

The statastician says "Oh, cows in Scotland must be brown!"

The physicist says "Well, we know there's a brown cow in Scotland."

The mathematician says "Not quite! We know there is at least one cow in Scotland, and at least half of it is brown!"

mfb and Ygggdrasil
mfb, Ygggdrasil and thephystudent
I guess mine was too bad, or too hidden :(.

The numbers remind me of a story I saw a while ago (not sure if it really happened): Some students released three pigs at a university, and had them labeled as "1", "2" and "4". The search for pig "3" took quite some time!Problem 4 needs some love. I'll assume that a detection has the same probability for all x between 1 and 20, otherwise we have insufficient information to start working on it. As we do not know the total number of particles, the distribution in the interval is the only thing we can use. We expect an exponential distribution, this is invariant under shifts, so for simplicity subtract 1 from all measured values and the range, we measure from 0 to 19. Unfortunately, within the small experimental sample, the best fit is a flat distribution. Looking at the data, this is not surprising as we have 4 out of 6 decays in the second half. We cannot set an upper limit on λ. For each event, we can calculate a likelihood:
$$L(x)=\frac {e^{-x/\lambda}} { \lambda \left(1-e^{-19/\lambda}\right) }$$
Calculate the product of all 6 events for a total likelihood, and take the negative logarithm of it for a nice scaling.

In images, first the distribution for this dataset, then the distribution for a "more normal" dataset, with more events for small x:

Given data, red line at the limit for λ->infinity.

Example data with a more usual distribution, red line at the limit for λ->infinity again.

A particle physicist would now probably look for the range where the negative log likelihood is not larger than ##\chi_2^{-1}(0.05)/2=1.92## above the minimum, which leads to ##\lambda > 4.0## at 95% confidence level.

Problem 1) The answer is obviously 0...

- Anyone traveling from Rohan to Mordor will be wanting to take their horse as well. Therefore much more room is needed on each carriage, and horses will make quite a mess on such a journey.

- Anyone/anything traveling from Mordor to Rohan is either an orc or a troll, or something even nastier, so they can bloody-well stand. (There won't be any Rohirrim returning from Mordor to Rohan, since they and their horses will have been eaten by the trolls. That's why the trolls want to go to Rohan -- for 2nd helpings.)

But on 2nd thoughts, the answer is: Go out and shoot whoever had the bright idea of building a Rohan--Mordor railway in the first place!

micromass
micromass said:
That's for you to decide on. I left the question very open for this purpose. All I'm giving is the number of people on the train on several days, the question is how you would decide how many seats the train should have. Clearly we don't want it be too long since that's a waste of money. But we don't want it to be too short either, otherwise we get complaints about people not finding seats.

But depending on how we value "cost incurred by passenger in not finding a seat" vs. "cost incurred by company in making a seat available that is unused" the answers would differ. If you do not make those values clear upfront, how can an answer be found?

DocZaius said:
But depending on how we value "cost incurred by passenger in not finding a seat" vs. "cost incurred by company in making a seat available that is unused" the answers would differ. If you do not make those values clear upfront, how can an answer be found?

You assume the answer is unique. It is not. You can make all assumptions you want, just clearly state them and try to be somewhat realistic.

The answer to 2: the first one is the real random sequence.
The answer to 9: the text is a simple Caesar code of an English text. Up to you to decode it to see what it says. I think fresh_42 already did this.

micromass said:
The answer to 2: the first one is the real random sequence.
The answer to 9: the text is a simple Caesar code of an English text. Up to you to decode it to see what it says. I think fresh_42 already did this.

And now I'm struggling with an overflow in RSA ...

ProfuselyQuarky
micromass said:
The answer to 2: the first one is the real random sequence.
An odd one, however. So my initial impression was right, and then I overthought the problem.

mfb said:
An odd one, however. So my initial impression was right, and then I overthought the problem.
Yep. One more brick in my wall of discomfort when I hear someone, esp. politicians, argue with statistics.

micromass
This one (#2) surprised me too, but the probability of either sequence coming up (again) exactly the same in a random flipping of 200 coins is ## p=(1/2)^{200} ##.

This one (#2) surprised me too, but the probability of either sequence coming up (again) exactly the same in a random flipping of 200 coins is ## p=(1/2)^{200} ##.
Well.. yes, but that's not what the question was about.

mfb said:
Well.. yes, but that's not what the question was about.
It would appear when micromass selected a "random" coin flip sequence, the unlikely occurred. A couple of simple tests showed it wound up ## 2 \sigma ## or more from the mean. Not too unlikely, but not the most common case.

I agree that the answer to #2 is a bit odd. The standard way to test for randomness in a dataset (or at least to test whether events are independent) is the Wald-Wolfowitz run test. Dataset 1 contains 108 heads, & 91 tails, so one would expect to see 99.8 runs with a standard deviation of 6.98. The actual number of runs observed was 82, which is 2.5σ below expected, yielding a p-value ~ 0.013

Dataset 2 contains 105 heads and 94 tails, so one would expect to see 100.2 runs with a standard devation of 7.01. The actual number of observed runs is 104, just 0.5σ away from expected, yielding a p-value ~ 0.64.

(Calculations done with the runstest function in MATLAB)

micromass said:
1. Take the following two sequences of coin tosses:

Code:
THHHHTTTTHHHHTHHHHHHHHTTTHHTTHHHHHTTTTTTHHTHHTHHHTTTHTTHHHHTHTTHTTTHHTTTTHHHHHHTTTHHTTHHHTHHHHHTTTTTHTTTHHTTHTTHHTTTHHTTTHHTHHTHHTTTTTHHTHHHHHHTHTHTTHTHTTHHHTTHHTHTHHHHHHHHTTHTTHHHTHHTTHTTTTTTHHHTHHH

Code:
THTHTTTHTTTTHTHTTTHTTHHHTHHTHTHTHTTTTHHTTHHTTHHHTHHHTTHHHTTTHHHTHHHHTTTHTHTHHHHTHTTTHHHTHHTHTTTHTHHHTHHHHTTHTHHTHHHTTTHTHHHTHHTTTHHHTTTTHHHTHTHHHHTHTTHHTTTTHTHTHTTHTHHTTHTTTHTTTTHHHHTHTHHHTTHHHHHTHHH

One of these sequences is from an actual coin toss experiment. The other is invented by a human. Find out which of these is which.

We can assume that if a human invents a binary string, then one or both of the following things may happen:

1) the human does not (or can not) keep track of how many H's and T's he has previously generated, so he/she ends up creating a string where the observations are highly biased towards H (or T).

2) the human does not (or can not) remember the whole sequence of H's and T's he/she has generated so far, so he/she tends to generate new observations based on the (few) previous observation(s).

Based on the above assumptions I would propose a decision rule based on the result of the two following tests:

Test #1: Let's call α the probability of getting H in a coin toss. Estimate α by calculating the arithmetic average of H's in the string (this is a maximum likelihood estimator) and call this number $\hat{\alpha}$. Perform a likelihood-ratio test between the probability of obtaining the given string under the two hypotheses that $\alpha=\hat{\alpha}$ and $\alpha=\frac{1}{2}$. If the likelihood of the former hypothesis is higher, then conclude that the string was generated by a human. This test can be expressed by checking whether $\hat{\alpha}$ exceeds a certain threshold: $$\hat{\alpha} < \frac{\ln \left( 2(1-\hat{\alpha}) \right)}{\ln(\hat{\alpha}-1)}$$

Test #2: Consider two consecutive observations X,Y. Estimate the joint probability distribution PX,Y by considering all the consecutive pairs of observations in the given string and by filling a 2x2 contingency table of the occurences of the four sequences HH, TH, HT, TT in the given string. Perform a $\chi^2$-test of independence. If the hypothesis of independence is rejected, then we can deduce that there was probably a correlation between consecutive observations. In such case, we conclude that the string was generated by a human.The two strings provided in the original post both pass Test #1 but the first string does not pass Test #2: the corresponding p-values were 0.0164 for the first string, and 0.4944 for the second string and the threshold was set to 0.05, thus we conclude that the first string was generated by a human.

As an additional remark I can see a possible connection between this problem and the theory of data compression. Data that is intelligible (or generated) by a human often contains redundancies which are typically due to the correlation of different portions of data. Such correlations allow compression algorithms to predict the successive portions of a stream of data, given the knowledge of the previous data. This is typically not true for random noise, which is why a realization of true random noise is difficult to compress.

Last edited:
mnb96 said:
) the human does not (or can not) keep track of how many H's and T's he has previously generated, so he/she ends up creating a string where the observations are highly biased towards H (or T).

2) the human does not (or can not) remember the whole sequence of H's and T's he/she has generated so far, so he/she tends to generate new observations based on the (few) previous observation(s).
The random generation doesn't keep track of that by definition, a human must reproduce that in order to generate a plausible random distribution.
mnb96 said:
f the likelihood of the former hypothesis is higher, then conclude that the string was generated by a human.
It will always be higher, unless we happen to have exactly as many T as H, which is unlikely for a randomly generated string.
##\hat \alpha \leq 1##, ##\ln(\hat \alpha -1)## doesn't work.
mnb96 said:
If the hypothesis of independence is rejected, then we can deduce that there was probably a correlation between consecutive observations. In such case, we conclude that the string was generated by a human.
A randomly generated string will have correlations, because HH can follow after TH or HH, but not after HT or TT.

mfb said:
The random generation doesn't keep track of that by definition
I never said "by definition".
I said "one of the following things may happen...", implying that a human may or may not keep track of the current count of H's: if he does not, he might ideally get caught by a properly designed statistical test (not mine), if he actually does, then he may still not pass Test #2.

mfb said:
It will always be higher, unless we happen to have exactly as many T as H
This is probably a good point. Perhaps the test I proposed to check whether the string is produced by a biased coin flip experiment is not good (besides I have probably made a mistake in deriving the final formula). I have just noticed that some users were previously working on interesting ideas to achieve the same task. Maybe they will come up with a better solution than mine.

mfb said:
A randomly generated string will have correlations
I think this statement, in its current form, is incorrect, but I probably understand what you meant in the context: if we have a string z1...zN we can already say something about the pair (zN, zN+1). On the other hand, I think we cannot say anything about (zN+1, zN+2). I guess that this just implies that when we populate the 2x2 contingency table we should not consider "all the consecutive pairs" of the string, as I previously said, but rather all the disjoint consecutive pairs.
Unfortunately, if I do that, then the p-values are well above the threshold for both strings, so the test does not work.

Last edited:
mnb96 said:
I never said "by definition".
I said it.
mnb96 said:
implying that a human may or may not keep track of the current count of H's: if he does not, he might ideally get caught by a properly designed statistical test
That is wrong.
The random sequence does not keep track. Why should a human have to keep track?
mnb96 said:
This is probably a good point. Perhaps the test I proposed to check whether the string is produced by a biased coin flip experiment is not good (besides I have probably made a mistake in deriving the final formula). I have just noticed that some users were previously working on interesting ideas to achieve the same task. Maybe they will come up with a better solution than mine.
We had simple hypothesis testing already: how likely is it to get a larger deviation from 50% than observed?
The probability to observe 91 or fewer T or H in 199 trials (like sequence 1) is 0.25, the probability to observe 94 or fewer T or H in 199 trials (like sequence 2) is 0.48. Not really a strong preference for one sequence here.

mnb96 said:
I think this statement, in its current form, is incorrect, but I probably understand what you meant in the context: if we have a string z1...zN we can already say something about the pair (zN, zN+1). On the other hand, I think we cannot say anything about (zN+1, zN+2)
Well, we know 8 out of 16 things the pair cannot be.
mnb96 said:
we should not consider "all the consecutive pairs" of the string, as I previously said, but rather all the disjoint consecutive pairs.
We can do that, but that discards a lot of information.

Check the previous pages, there was already a lot of analysis in that direction.