The discussion centers on a statistics challenge involving various probability problems, including train passenger estimates, coin toss sequences, fish population estimation, and distinguishing between real and generated text. Participants are tasked with providing not only answers but also detailed strategies and reasoning for their approaches, emphasizing the importance of statistical models. The thread encourages the use of outside sources for reference while prohibiting direct searches for specific answers. The goal is to foster a collaborative environment for exploring statistical concepts and methodologies. Engaging with these problems enhances understanding of probability theory and its applications in real-world scenarios.
The answer to 2: the first one is the real random sequence.
The answer to 9: the text is a simple Caesar code of an English text. Up to you to decode it to see what it says. I think fresh_42 already did this.
The answer to 2: the first one is the real random sequence.
The answer to 9: the text is a simple Caesar code of an English text. Up to you to decode it to see what it says. I think fresh_42 already did this.
And now I'm struggling with an overflow in RSA ...
This one (#2) surprised me too, but the probability of either sequence coming up (again) exactly the same in a random flipping of 200 coins is ## p=(1/2)^{200} ##.
This one (#2) surprised me too, but the probability of either sequence coming up (again) exactly the same in a random flipping of 200 coins is ## p=(1/2)^{200} ##.
Well.. yes, but that's not what the question was about.
Well.. yes, but that's not what the question was about.
It would appear when micromass selected a "random" coin flip sequence, the unlikely occurred. A couple of simple tests showed it wound up ## 2 \sigma ## or more from the mean. Not too unlikely, but not the most common case.
I agree that the answer to #2 is a bit odd. The standard way to test for randomness in a dataset (or at least to test whether events are independent) is the Wald-Wolfowitz run test. Dataset 1 contains 108 heads, & 91 tails, so one would expect to see 99.8 runs with a standard deviation of 6.98. The actual number of runs observed was 82, which is 2.5σ below expected, yielding a p-value ~ 0.013
Dataset 2 contains 105 heads and 94 tails, so one would expect to see 100.2 runs with a standard devation of 7.01. The actual number of observed runs is 104, just 0.5σ away from expected, yielding a p-value ~ 0.64.
(Calculations done with the runstest function in MATLAB)
One of these sequences is from an actual coin toss experiment. The other is invented by a human. Find out which of these is which.
What about the following solution:
We can assume that if a human invents a binary string, then one or both of the following things may happen:
1) the human does not (or can not) keep track of how many H's and T's he has previously generated, so he/she ends up creating a string where the observations are highly biased towards H (or T).
2) the human does not (or can not) remember the whole sequence of H's and T's he/she has generated so far, so he/she tends to generate new observations based on the (few) previous observation(s).
Based on the above assumptions I would propose a decision rule based on the result of the two following tests:
Test #1:Let's call α the probability of getting H in a coin toss. Estimate α by calculating the arithmetic average of H's in the string (this is a maximum likelihood estimator) and call this number \hat{\alpha}. Perform a likelihood-ratio test between the probability of obtaining the given string under the two hypotheses that \alpha=\hat{\alpha} and \alpha=\frac{1}{2}. If the likelihood of the former hypothesis is higher, then conclude that the string was generated by a human. This test can be expressed by checking whether \hat{\alpha} exceeds a certain threshold: \hat{\alpha} < \frac{\ln \left( 2(1-\hat{\alpha}) \right)}{\ln(\hat{\alpha}-1)}
Test #2: Consider two consecutive observations X,Y. Estimate the joint probability distribution PX,Y by considering all the consecutive pairs of observations in the given string and by filling a 2x2 contingency table of the occurences of the four sequences HH, TH, HT, TT in the given string. Perform a \chi^2-test of independence. If the hypothesis of independence is rejected, then we can deduce that there was probably a correlation between consecutive observations. In such case, we conclude that the string was generated by a human.The two strings provided in the original post both pass Test #1 but the first string does not pass Test #2: the corresponding p-values were 0.0164 for the first string, and 0.4944 for the second string and the threshold was set to 0.05, thus we conclude that the first string was generated by a human.
As an additional remark I can see a possible connection between this problem and the theory of data compression. Data that is intelligible (or generated) by a human often contains redundancies which are typically due to the correlation of different portions of data. Such correlations allow compression algorithms to predict the successive portions of a stream of data, given the knowledge of the previous data. This is typically not true for random noise, which is why a realization of true random noise is difficult to compress.
) the human does not (or can not) keep track of how many H's and T's he has previously generated, so he/she ends up creating a string where the observations are highly biased towards H (or T).
2) the human does not (or can not) remember the whole sequence of H's and T's he/she has generated so far, so he/she tends to generate new observations based on the (few) previous observation(s).
The random generation doesn't keep track of that by definition, a human must reproduce that in order to generate a plausible random distribution.
mnb96 said:
f the likelihood of the former hypothesis is higher, then conclude that the string was generated by a human.
It will always be higher, unless we happen to have exactly as many T as H, which is unlikely for a randomly generated string.
##\hat \alpha \leq 1##, ##\ln(\hat \alpha -1)## doesn't work.
mnb96 said:
If the hypothesis of independence is rejected, then we can deduce that there was probably a correlation between consecutive observations. In such case, we conclude that the string was generated by a human.
A randomly generated string will have correlations, because HH can follow after TH or HH, but not after HT or TT.
#101
mnb96
711
5
mfb said:
The random generation doesn't keep track of that by definition
I never said "by definition".
I said "one of the following things may happen...", implying that a human may or may not keep track of the current count of H's: if he does not, he might ideally get caught by a properly designed statistical test (not mine), if he actually does, then he may still not pass Test #2.
mfb said:
It will always be higher, unless we happen to have exactly as many T as H
This is probably a good point. Perhaps the test I proposed to check whether the string is produced by a biased coin flip experiment is not good (besides I have probably made a mistake in deriving the final formula). I have just noticed that some users were previously working on interesting ideas to achieve the same task. Maybe they will come up with a better solution than mine.
mfb said:
A randomly generated string will have correlations
I think this statement, in its current form, is incorrect, but I probably understand what you meant in the context: if we have a string z1...zNwe can already say something about the pair (zN, zN+1). On the other hand, I think we cannot say anything about (zN+1, zN+2). I guess that this just implies that when we populate the 2x2 contingency table we should not consider "all the consecutive pairs" of the string, as I previously said, but rather all the disjoint consecutive pairs.
Unfortunately, if I do that, then the p-values are well above the threshold for both strings, so the test does not work.
implying that a human may or may not keep track of the current count of H's: if he does not, he might ideally get caught by a properly designed statistical test
That is wrong.
The random sequence does not keep track. Why should a human have to keep track?
mnb96 said:
This is probably a good point. Perhaps the test I proposed to check whether the string is produced by a biased coin flip experiment is not good (besides I have probably made a mistake in deriving the final formula). I have just noticed that some users were previously working on interesting ideas to achieve the same task. Maybe they will come up with a better solution than mine.
We had simple hypothesis testing already: how likely is it to get a larger deviation from 50% than observed?
The probability to observe 91 or fewer T or H in 199 trials (like sequence 1) is 0.25, the probability to observe 94 or fewer T or H in 199 trials (like sequence 2) is 0.48. Not really a strong preference for one sequence here.
mnb96 said:
I think this statement, in its current form, is incorrect, but I probably understand what you meant in the context: if we have a string z1...zNwe can already say something about the pair (zN, zN+1). On the other hand, I think we cannot say anything about (zN+1, zN+2)
Well, we know 8 out of 16 things the pair cannot be.
mnb96 said:
we should not consider "all the consecutive pairs" of the string, as I previously said, but rather all the disjoint consecutive pairs.
We can do that, but that discards a lot of information.
Check the previous pages, there was already a lot of analysis in that direction.