Challenge Micromass' big statistics challenge

micromass · Jun 1, 2016

The answer to 2: the first one is the real random sequence.
The answer to 9: the text is a simple Caesar code of an English text. Up to you to decode it to see what it says. I think fresh_42 already did this.

fresh_42 · Jun 1, 2016

micromass said:

The answer to 2: the first one is the real random sequence.
The answer to 9: the text is a simple Caesar code of an English text. Up to you to decode it to see what it says. I think fresh_42 already did this.

And now I'm struggling with an overflow in RSA ...

mfb · Jun 1, 2016

micromass said:

The answer to 2: the first one is the real random sequence.

An odd one, however. So my initial impression was right, and then I overthought the problem.

fresh_42 · Jun 1, 2016

mfb said:

An odd one, however. So my initial impression was right, and then I overthought the problem.

Yep. One more brick in my wall of discomfort when I hear someone, esp. politicians, argue with statistics.

Charles Link · Jun 1, 2016

This one (#2) surprised me too, but the probability of either sequence coming up (again) exactly the same in a random flipping of 200 coins is ## p=(1/2)^{200} ##.

mfb · Jun 2, 2016

Charles Link said:

This one (#2) surprised me too, but the probability of either sequence coming up (again) exactly the same in a random flipping of 200 coins is ## p=(1/2)^{200} ##.

Well.. yes, but that's not what the question was about.

Charles Link · Jun 2, 2016

mfb said:

Well.. yes, but that's not what the question was about.

It would appear when micromass selected a "random" coin flip sequence, the unlikely occurred. A couple of simple tests showed it wound up ## 2 \sigma ## or more from the mean. Not too unlikely, but not the most common case.

Ygggdrasil · Jun 2, 2016

I agree that the answer to #2 is a bit odd. The standard way to test for randomness in a dataset (or at least to test whether events are independent) is the Wald-Wolfowitz run test. Dataset 1 contains 108 heads, & 91 tails, so one would expect to see 99.8 runs with a standard deviation of 6.98. The actual number of runs observed was 82, which is 2.5σ below expected, yielding a p-value ~ 0.013

Dataset 2 contains 105 heads and 94 tails, so one would expect to see 100.2 runs with a standard devation of 7.01. The actual number of observed runs is 104, just 0.5σ away from expected, yielding a p-value ~ 0.64.

(Calculations done with the runstest function in MATLAB)

mnb96 · Jun 9, 2016

micromass said:
Take the following two sequences of coin tosses:
Code:
THHHHTTTTHHHHTHHHHHHHHTTTHHTTHHHHHTTTTTTHHTHHTHHHTTTHTTHHHHTHTTHTTTHHTTTTHHHHHHTTTHHTTHHHTHHHHHTTTTTHTTTHHTTHTTHHTTTHHTTTHHTHHTHHTTTTTHHTHHHHHHTHTHTTHTHTTHHHTTHHTHTHHHHHHHHTTHTTHHHTHHTTHTTTTTTHHHTHHH
Code:
THTHTTTHTTTTHTHTTTHTTHHHTHHTHTHTHTTTTHHTTHHTTHHHTHHHTTHHHTTTHHHTHHHHTTTHTHTHHHHTHTTTHHHTHHTHTTTHTHHHTHHHHTTHTHHTHHHTTTHTHHHTHHTTTHHHTTTTHHHTHTHHHHTHTTHHTTTTHTHTHTTHTHHTTHTTTHTTTTHHHHTHTHHHTTHHHHHTHHH
One of these sequences is from an actual coin toss experiment. The other is invented by a human. Find out which of these is which.

What about the following solution:
We can assume that if a human invents a binary string, then one or both of the following things may happen:

1) the human does not (or can not) keep track of how many H's and T's he has previously generated, so he/she ends up creating a string where the observations are highly biased towards H (or T).

2) the human does not (or can not) remember the whole sequence of H's and T's he/she has generated so far, so he/she tends to generate new observations based on the (few) previous observation(s).

Based on the above assumptions I would propose a decision rule based on the result of the two following tests:

Test #1: Let's call α the probability of getting H in a coin toss. Estimate α by calculating the arithmetic average of H's in the string (this is a maximum likelihood estimator) and call this number \hat{\alpha}. Perform a likelihood-ratio test between the probability of obtaining the given string under the two hypotheses that \alpha=\hat{\alpha} and \alpha=\frac{1}{2}. If the likelihood of the former hypothesis is higher, then conclude that the string was generated by a human. This test can be expressed by checking whether \hat{\alpha} exceeds a certain threshold: \hat{\alpha} < \frac{\ln \left( 2(1-\hat{\alpha}) \right)}{\ln(\hat{\alpha}-1)}

Test #2: Consider two consecutive observations X,Y. Estimate the joint probability distribution P_X,Y by considering all the consecutive pairs of observations in the given string and by filling a 2x2 contingency table of the occurences of the four sequences HH, TH, HT, TT in the given string. Perform a \chi^2-test of independence. If the hypothesis of independence is rejected, then we can deduce that there was probably a correlation between consecutive observations. In such case, we conclude that the string was generated by a human.The two strings provided in the original post both pass Test #1 but the first string does not pass Test #2: the corresponding p-values were 0.0164 for the first string, and 0.4944 for the second string and the threshold was set to 0.05, thus we conclude that the first string was generated by a human.

As an additional remark I can see a possible connection between this problem and the theory of data compression. Data that is intelligible (or generated) by a human often contains redundancies which are typically due to the correlation of different portions of data. Such correlations allow compression algorithms to predict the successive portions of a stream of data, given the knowledge of the previous data. This is typically not true for random noise, which is why a realization of true random noise is difficult to compress.

mfb · Jun 9, 2016

mnb96 said:

) the human does not (or can not) keep track of how many H's and T's he has previously generated, so he/she ends up creating a string where the observations are highly biased towards H (or T).

2) the human does not (or can not) remember the whole sequence of H's and T's he/she has generated so far, so he/she tends to generate new observations based on the (few) previous observation(s).

The random generation doesn't keep track of that by definition, a human must reproduce that in order to generate a plausible random distribution.

mnb96 said:

f the likelihood of the former hypothesis is higher, then conclude that the string was generated by a human.

It will always be higher, unless we happen to have exactly as many T as H, which is unlikely for a randomly generated string.
##\hat \alpha \leq 1##, ##\ln(\hat \alpha -1)## doesn't work.

mnb96 said:

If the hypothesis of independence is rejected, then we can deduce that there was probably a correlation between consecutive observations. In such case, we conclude that the string was generated by a human.

A randomly generated string will have correlations, because HH can follow after TH or HH, but not after HT or TT.

mnb96 · Jun 9, 2016

mfb said:

The random generation doesn't keep track of that by definition

I never said "by definition".
I said "one of the following things may happen...", implying that a human may or may not keep track of the current count of H's: if he does not, he might ideally get caught by a properly designed statistical test (not mine), if he actually does, then he may still not pass Test #2.

mfb said:

It will always be higher, unless we happen to have exactly as many T as H

This is probably a good point. Perhaps the test I proposed to check whether the string is produced by a biased coin flip experiment is not good (besides I have probably made a mistake in deriving the final formula). I have just noticed that some users were previously working on interesting ideas to achieve the same task. Maybe they will come up with a better solution than mine.

mfb said:

A randomly generated string will have correlations

I think this statement, in its current form, is incorrect, but I probably understand what you meant in the context: if we have a string z₁...z_N we can already say something about the pair (z_N, z_N+1). On the other hand, I think we cannot say anything about (z_N+1, z_N+2). I guess that this just implies that when we populate the 2x2 contingency table we should not consider "all the consecutive pairs" of the string, as I previously said, but rather all the disjoint consecutive pairs.
Unfortunately, if I do that, then the p-values are well above the threshold for both strings, so the test does not work.

mfb · Jun 9, 2016

mnb96 said:

I never said "by definition".

I said it.

mnb96 said:

implying that a human may or may not keep track of the current count of H's: if he does not, he might ideally get caught by a properly designed statistical test

That is wrong.
The random sequence does not keep track. Why should a human have to keep track?

mnb96 said:

This is probably a good point. Perhaps the test I proposed to check whether the string is produced by a biased coin flip experiment is not good (besides I have probably made a mistake in deriving the final formula). I have just noticed that some users were previously working on interesting ideas to achieve the same task. Maybe they will come up with a better solution than mine.

We had simple hypothesis testing already: how likely is it to get a larger deviation from 50% than observed?
The probability to observe 91 or fewer T or H in 199 trials (like sequence 1) is 0.25, the probability to observe 94 or fewer T or H in 199 trials (like sequence 2) is 0.48. Not really a strong preference for one sequence here.

mnb96 said:

I think this statement, in its current form, is incorrect, but I probably understand what you meant in the context: if we have a string z₁...z_N we can already say something about the pair (z_N, z_N+1). On the other hand, I think we cannot say anything about (z_N+1, z_N+2)

Well, we know 8 out of 16 things the pair cannot be.

mnb96 said:

we should not consider "all the consecutive pairs" of the string, as I previously said, but rather all the disjoint consecutive pairs.

We can do that, but that discards a lot of information.

Check the previous pages, there was already a lot of analysis in that direction.

Challenge Micromass' big statistics challenge

Undergrad Trigonometry problem of interest

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Undergrad Geometry problem of interest with a 3-4-5 triangle

High School Excel: converting a 3-ish week count into a monthly count

High School Six Pencil Puzzle

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers