Micromass' big statistics challenge

Click For Summary

Discussion Overview

The discussion revolves around various statistical problems and challenges, focusing on probability theory and statistical reasoning. Participants are tasked with developing strategies to solve open-ended questions related to statistics, including coin toss sequences, estimating populations, and analyzing experimental data.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests finding the optimal number of seats on a train based on daily passenger counts, emphasizing the need for a detailed strategy and model justification.
  • Another participant presents two sequences of coin tosses, proposing to determine which is from an actual experiment and which is fabricated, using statistical tests.
  • A different participant describes a method for estimating the fish population in a lake using a catch-and-recapture technique, referencing the proportion of marked fish recaptured.
  • One participant discusses the decay of unstable particles and the estimation of a characteristic length based on observed decay events, raising questions about the underlying distribution.
  • Another participant presents a scenario involving a psychic predicting a player's actions in a game, prompting discussion on decision-making under uncertainty and the implications of the psychic's predictions.
  • Several participants reference various statistical tests and methods, including the chi-squared test, to analyze the randomness of sequences and outcomes.

Areas of Agreement / Disagreement

Participants express a range of viewpoints on the statistical problems presented, with no clear consensus reached on the best strategies or interpretations of the results. Multiple competing models and approaches are discussed, reflecting the complexity of the statistical challenges.

Contextual Notes

Some participants note limitations in their analyses, such as assumptions about randomness and the dependence on specific statistical models. The discussion highlights the need for careful consideration of the underlying assumptions in statistical reasoning.

Who May Find This Useful

This discussion may be of interest to those studying statistics, probability theory, or related fields, as well as practitioners looking for diverse approaches to statistical challenges.

  • #91
The answer to 2: the first one is the real random sequence.
The answer to 9: the text is a simple Caesar code of an English text. Up to you to decode it to see what it says. I think fresh_42 already did this.
 
Physics news on Phys.org
  • #92
micromass said:
The answer to 2: the first one is the real random sequence.
The answer to 9: the text is a simple Caesar code of an English text. Up to you to decode it to see what it says. I think fresh_42 already did this.


And now I'm struggling with an overflow in RSA ...
 
  • Like
Likes   Reactions: ProfuselyQuarky
  • #93
micromass said:
The answer to 2: the first one is the real random sequence.
An odd one, however. So my initial impression was right, and then I overthought the problem.
 
  • #94
mfb said:
An odd one, however. So my initial impression was right, and then I overthought the problem.
Yep. One more brick in my wall of discomfort when I hear someone, esp. politicians, argue with statistics.
 
  • Like
Likes   Reactions: micromass
  • #95
This one (#2) surprised me too, but the probability of either sequence coming up (again) exactly the same in a random flipping of 200 coins is ## p=(1/2)^{200} ##.
 
  • #96
Charles Link said:
This one (#2) surprised me too, but the probability of either sequence coming up (again) exactly the same in a random flipping of 200 coins is ## p=(1/2)^{200} ##.
Well.. yes, but that's not what the question was about.
 
  • #97
mfb said:
Well.. yes, but that's not what the question was about.
It would appear when micromass selected a "random" coin flip sequence, the unlikely occurred. A couple of simple tests showed it wound up ## 2 \sigma ## or more from the mean. Not too unlikely, but not the most common case.
 
  • #98
I agree that the answer to #2 is a bit odd. The standard way to test for randomness in a dataset (or at least to test whether events are independent) is the Wald-Wolfowitz run test. Dataset 1 contains 108 heads, & 91 tails, so one would expect to see 99.8 runs with a standard deviation of 6.98. The actual number of runs observed was 82, which is 2.5σ below expected, yielding a p-value ~ 0.013

Dataset 2 contains 105 heads and 94 tails, so one would expect to see 100.2 runs with a standard devation of 7.01. The actual number of observed runs is 104, just 0.5σ away from expected, yielding a p-value ~ 0.64.

(Calculations done with the runstest function in MATLAB)
 
  • Like
Likes   Reactions: Charles Link
  • #99
micromass said:
  1. Take the following two sequences of coin tosses:

    Code:
    THHHHTTTTHHHHTHHHHHHHHTTTHHTTHHHHHTTTTTTHHTHHTHHHTTTHTTHHHHTHTTHTTTHHTTTTHHHHHHTTTHHTTHHHTHHHHHTTTTTHTTTHHTTHTTHHTTTHHTTTHHTHHTHHTTTTTHHTHHHHHHTHTHTTHTHTTHHHTTHHTHTHHHHHHHHTTHTTHHHTHHTTHTTTTTTHHHTHHH

    Code:
    THTHTTTHTTTTHTHTTTHTTHHHTHHTHTHTHTTTTHHTTHHTTHHHTHHHTTHHHTTTHHHTHHHHTTTHTHTHHHHTHTTTHHHTHHTHTTTHTHHHTHHHHTTHTHHTHHHTTTHTHHHTHHTTTHHHTTTTHHHTHTHHHHTHTTHHTTTTHTHTHTTHTHHTTHTTTHTTTTHHHHTHTHHHTTHHHHHTHHH

    One of these sequences is from an actual coin toss experiment. The other is invented by a human. Find out which of these is which.

What about the following solution:
We can assume that if a human invents a binary string, then one or both of the following things may happen:

1) the human does not (or can not) keep track of how many H's and T's he has previously generated, so he/she ends up creating a string where the observations are highly biased towards H (or T).

2) the human does not (or can not) remember the whole sequence of H's and T's he/she has generated so far, so he/she tends to generate new observations based on the (few) previous observation(s).

Based on the above assumptions I would propose a decision rule based on the result of the two following tests:

Test #1: Let's call α the probability of getting H in a coin toss. Estimate α by calculating the arithmetic average of H's in the string (this is a maximum likelihood estimator) and call this number \hat{\alpha}. Perform a likelihood-ratio test between the probability of obtaining the given string under the two hypotheses that \alpha=\hat{\alpha} and \alpha=\frac{1}{2}. If the likelihood of the former hypothesis is higher, then conclude that the string was generated by a human. This test can be expressed by checking whether \hat{\alpha} exceeds a certain threshold: \hat{\alpha} < \frac{\ln \left( 2(1-\hat{\alpha}) \right)}{\ln(\hat{\alpha}-1)}

Test #2: Consider two consecutive observations X,Y. Estimate the joint probability distribution PX,Y by considering all the consecutive pairs of observations in the given string and by filling a 2x2 contingency table of the occurences of the four sequences HH, TH, HT, TT in the given string. Perform a \chi^2-test of independence. If the hypothesis of independence is rejected, then we can deduce that there was probably a correlation between consecutive observations. In such case, we conclude that the string was generated by a human.The two strings provided in the original post both pass Test #1 but the first string does not pass Test #2: the corresponding p-values were 0.0164 for the first string, and 0.4944 for the second string and the threshold was set to 0.05, thus we conclude that the first string was generated by a human.

As an additional remark I can see a possible connection between this problem and the theory of data compression. Data that is intelligible (or generated) by a human often contains redundancies which are typically due to the correlation of different portions of data. Such correlations allow compression algorithms to predict the successive portions of a stream of data, given the knowledge of the previous data. This is typically not true for random noise, which is why a realization of true random noise is difficult to compress.
 
Last edited:
  • #100
mnb96 said:
) the human does not (or can not) keep track of how many H's and T's he has previously generated, so he/she ends up creating a string where the observations are highly biased towards H (or T).

2) the human does not (or can not) remember the whole sequence of H's and T's he/she has generated so far, so he/she tends to generate new observations based on the (few) previous observation(s).
The random generation doesn't keep track of that by definition, a human must reproduce that in order to generate a plausible random distribution.
mnb96 said:
f the likelihood of the former hypothesis is higher, then conclude that the string was generated by a human.
It will always be higher, unless we happen to have exactly as many T as H, which is unlikely for a randomly generated string.
##\hat \alpha \leq 1##, ##\ln(\hat \alpha -1)## doesn't work.
mnb96 said:
If the hypothesis of independence is rejected, then we can deduce that there was probably a correlation between consecutive observations. In such case, we conclude that the string was generated by a human.
A randomly generated string will have correlations, because HH can follow after TH or HH, but not after HT or TT.
 
  • #101
mfb said:
The random generation doesn't keep track of that by definition
I never said "by definition".
I said "one of the following things may happen...", implying that a human may or may not keep track of the current count of H's: if he does not, he might ideally get caught by a properly designed statistical test (not mine), if he actually does, then he may still not pass Test #2.

mfb said:
It will always be higher, unless we happen to have exactly as many T as H
This is probably a good point. Perhaps the test I proposed to check whether the string is produced by a biased coin flip experiment is not good (besides I have probably made a mistake in deriving the final formula). I have just noticed that some users were previously working on interesting ideas to achieve the same task. Maybe they will come up with a better solution than mine.

mfb said:
A randomly generated string will have correlations
I think this statement, in its current form, is incorrect, but I probably understand what you meant in the context: if we have a string z1...zN we can already say something about the pair (zN, zN+1). On the other hand, I think we cannot say anything about (zN+1, zN+2). I guess that this just implies that when we populate the 2x2 contingency table we should not consider "all the consecutive pairs" of the string, as I previously said, but rather all the disjoint consecutive pairs.
Unfortunately, if I do that, then the p-values are well above the threshold for both strings, so the test does not work.
 
Last edited:
  • #102
mnb96 said:
I never said "by definition".
I said it.
mnb96 said:
implying that a human may or may not keep track of the current count of H's: if he does not, he might ideally get caught by a properly designed statistical test
That is wrong.
The random sequence does not keep track. Why should a human have to keep track?
mnb96 said:
This is probably a good point. Perhaps the test I proposed to check whether the string is produced by a biased coin flip experiment is not good (besides I have probably made a mistake in deriving the final formula). I have just noticed that some users were previously working on interesting ideas to achieve the same task. Maybe they will come up with a better solution than mine.
We had simple hypothesis testing already: how likely is it to get a larger deviation from 50% than observed?
The probability to observe 91 or fewer T or H in 199 trials (like sequence 1) is 0.25, the probability to observe 94 or fewer T or H in 199 trials (like sequence 2) is 0.48. Not really a strong preference for one sequence here.

mnb96 said:
I think this statement, in its current form, is incorrect, but I probably understand what you meant in the context: if we have a string z1...zN we can already say something about the pair (zN, zN+1). On the other hand, I think we cannot say anything about (zN+1, zN+2)
Well, we know 8 out of 16 things the pair cannot be.
mnb96 said:
we should not consider "all the consecutive pairs" of the string, as I previously said, but rather all the disjoint consecutive pairs.
We can do that, but that discards a lot of information.

Check the previous pages, there was already a lot of analysis in that direction.