Challenge Micromass' big statistics challenge

Click For Summary
The discussion centers on a statistics challenge involving various probability problems, including train passenger estimates, coin toss sequences, fish population estimation, and distinguishing between real and generated text. Participants are tasked with providing not only answers but also detailed strategies and reasoning for their approaches, emphasizing the importance of statistical models. The thread encourages the use of outside sources for reference while prohibiting direct searches for specific answers. The goal is to foster a collaborative environment for exploring statistical concepts and methodologies. Engaging with these problems enhances understanding of probability theory and its applications in real-world scenarios.
  • #31
Math_QED said:
We now see: X~B(9,1/2)

How do you know this?
 
Physics news on Phys.org
  • #32
micromass said:
How do you know this?

We only need 9 guesses to know the result. As you said earlier, 9 out of 10 is impossible. Is it correct?
 
  • #33
Math_QED said:
We only need 9 guesses to know the result. As you said earlier, 9 out of 10 is impossible. Is it correct?

I don't see at all how you take into account that I know 5 are pepsi and 5 are coca cola. In your formulation, it is entirely possibly that I say 8 are pepsi.
 
  • #34
micromass said:
I don't see at all how you take into account that I know 5 are pepsi and 5 are coca cola. In your formulation, it is entirely possibly that I say 8 are pepsi.

I supposed that the person who has to judge knows there are 5 pepsi's and 5 cola's. From where in my formulation would you say that is possible that there are 8 pepsis?
 
  • #35
micromass said:
How did you find these formulas?

They all reduce to variations on ##\sum \frac{1}{n(n+1)}##. For example:

##r = 3##

##\sum_{n = k}^{\infty} \frac{F_{n}}{n} = \frac{1}{k} + \frac{1}{k+1} \frac{k-2}{k} + \frac{1}{k+2} \frac{k-2}{k} \frac{k-1}{k+1} + \frac{1}{k+3} \frac{k-2}{k} \frac{k-1}{k+1} \frac{k}{k+2} \dots##

## = \frac{1}{k} + \frac{1}{k+1} \frac{k-2}{k} + (k-1)(k-2)[\frac{1}{k(k+1)(k+2)} + \frac{1}{(k+1)(k+2)(k+3)} \dots]##

## = \frac{1}{k} + \frac{1}{k+1} \frac{k-2}{k} + (k-1)(k-2)[\frac{1}{2k(k+1)}]## (Summed using partial fractions)

## = \frac{1}{2}##
 
  • #36
Math_QED said:
I supposed that the person who has to judge knows there are 5 pepsi's and 5 cola's.

I don't think you did. In either case, your answer is incorrect.
 
  • #37
micromass said:
I don't think you did. In either case, your answer is incorrect.

Then I'll wait for someone else to solve it. I don't even know whether I was on the right track.
 
  • #38
PeroK said:
They all reduce to variations on ##\sum \frac{1}{n(n+1)}##. For example:

##r = 3##

##\sum_{n = k}^{\infty} \frac{F_{n}}{n} = \frac{1}{k} + \frac{1}{k+1} \frac{k-2}{k} + \frac{1}{k+2} \frac{k-2}{k} \frac{k-1}{k+1} + \frac{1}{k+3} \frac{k-2}{k} \frac{k-1}{k+1} \frac{k}{k+2} \dots##

## = \frac{1}{k} + \frac{1}{k+1} \frac{k-2}{k} + (k-1)(k-2)[\frac{1}{k(k+1)(k+2)} + \frac{1}{(k+1)(k+2)(k+3)} \dots]##

## = \frac{1}{k} + \frac{1}{k+1} \frac{k-2}{k} + (k-1)(k-2)[\frac{1}{2k(k+1)}]## (Summed using partial fractions)

## = \frac{1}{2}##

Is it possible to derive a closed form for the ##F_n##?
 
  • #39
micromass said:
Is it possible to derive a closed form for the ##F_n##?

For ##n \ge k + r - 1##

##F_n = \frac{(k-1)(k-2) \dots (k-r+1)}{(n-r+1)(n-r+2) \dots (n-1)}##

When you sum the ##F_n## you get ##r - 1## initial terms, then a mutiply of an infinite sum involving the product of ##r-1## consecutive reciprocals.

There's a correction needed I think. I just did the sum for ##r = 3## and:

##\sum_{n = k}^{\infty} F_{n} = \frac{k-1}{r-2}##

So:

##E(n) = \frac{(k-1)(r-1)}{r-2}##

E.g. ##k = 213, r = 5, E(n) = 282.67##
 
  • #40
micromass said:
4. Unstable particles are emitted from a source and decay at a distance xx, a real number that has an exponential distribution with characteristic length λ\lambda. Decay events can be observed only if they occur in a window extending from x=1x=1 to x=20x=20. We observe 66 decays at locations {2,5,12,13,13,16}\{2,5,12,13,13,16\}. What is λ\lambda?

Seems like the approach to take here would be
maximum likelihood estimation.

Do you want us to work this out by hand, or would it be sufficient to post code? (Or does knowing this much disqualify me from answering)
 
  • #41
Ygggdrasil said:
Seems like the approach to take here would be
maximum likelihood estimation.

Do you want us to work this out by hand, or would it be sufficient to post code? (Or does knowing this much disqualify me from answering)

Posting code and the answer would be enough.
 
  • #42
micromass said:
Posting code and the answer would be enough.
Ok.

The strategy to answer #4 is to use maximum likelihood estimation.

In Matlab:
Code:
P = @(x,lambda) exppdf(x,lambda)/(expcdf(20,lambda)-expcdf(1,lambda));
data = [2 5 12 13 13 16];
est = mle(data,'pdf',P,'start',mean(data),'lower',0)
Gives an estimate of λ = 90.2
 
  • #43
For #9, can you confirm whether all 26 letters plus space are allowed in principle in the new language/in the random text generation? And, if random, do yo mean 'pulled from a uniform distribution' or 'I just hit my keyboard'. If it is not random, can we assume the original text was in English? Is there a letter-by letter correspondance, word-by-word correspondance or something more delocalized like a 'fourier transform' or so?
 
  • #44
thephystudent said:
For #9, can you confirm whether all 26 letters plus space are allowed in principle in the new language/in the random text generation?

Yes.

And, if random, do yo mean 'pulled from a uniform distribution' or 'I just hit my keyboard'. If it is not random, can we assume the original text was in English?

No, you cannot assume that the "random text" comes from a uniform distribution (which I think is very clearly not the case if you look at the data). And you cannot assume the original text was English.

Is there a letter-by letter correspondance, word-by-word correspondance or something more delocalized like a 'fourier transform' or so?

I won't give you this information.
 
  • Like
Likes thephystudent
  • #45
micromass said:
No, you cannot assume that the "random text" comes from a uniform distribution (which I think is very clearly not the case if you look at the data). And you cannot assume the original text was English.

The problem is that it seems to me that a number generating system can always be made in a particular way in order to mislead any test you want to do. And coding mechanisms can also be optimized to make the text look like random.

I have a tendency to think the text is real for now, but I don't have sufficient arguments to show it.
 
  • #46
Problem 8

I'll try to tackle this problem in a mostly informal manner. What is of great importance according to my analysis of the problem, is what is not given or implied. So, in the sentence "take turns at washing dishes", what are these turns? are they regular or not? do they favor youngest girl or not? Then, I'll go on with the sentence "There were four breakages". Over what period of time those breakages took place? Going on, "Three of them were caused by the youngest girl". How young is she really?

Let's assume that turns at washing dishes are kept regular and they do not favor any of the girls. Then the remaining two questions will give some clue: if the breakages took place in a sufficiently small period of time i.e. in 16 days or a number with an upper bound relatively small - I cannot give a specific number here, then if "young" is not so young that cannot hold everything that must be washed properly, it may well be the case that she is clumsy, otherwise if she is so young then we just cannot tell if she is clumsy or not - in fact clumsiness would not have its proper meaning. Otherwise, if the time period is sufficiently large then no matter if she is so young or not, it may well be the case that she is not clumsy at all. Now, if turns are not regular i.e. favor for whatever reason one of the three other girls ,then for a sufficiently small period of time, if the youngest is so young, we can't tell if she is clumsy but if not, it may well be the case that she does the breakages on purpose or she is clumsy (with odds favoring the former). For a sufficiently large period of time, no matter if the youngest girl is too young or not, she is not clumsy at all. Finally, if turns at washing dishes favor the youngest girl, then for a sufficiently small period of time, if she is too young we can't tell if she is clumsy or not but if she is not, then it turns out that she is probably clumsy. For a sufficiently large period of time, then no matter if she is too young or not, she is rather not clumsy.

I know I just talked in the abstract, but I cannot think of a way to quantify odds properly for this problem. A rough sketch of a tree diagram could be constructed, but because the odds in the process of branching cannot be defined in a precise numerical manner - e.g. what constitutes a sufficiently small/long period of time, what is the dividing line between too young or not, to name a few, I did this rough and abstract analysis. So, in conclusion I cannot see a way of determining if the youngest girl is clumsy or not provided the data given.

EDIT: in all the above I assume that there is no health problem for the youngest girl and there is no external factor of any kind that can have any kind of influence.
 
Last edited:
  • Like
Likes micromass
  • #47
For question 5:
PeroK said:
##E(n) = \frac{(k-1)(r-1)}{r-2}##

I've managed to prove the general formulas for this now. If anyone is interested:

First, for partial fractions:

##\frac{1}{n(n+1) \dots (n+r)} = \frac{1}{r!}[\frac{1}{n} - \binom{r}{1} \frac{1}{n+1} + \binom{r}{2} \frac{1}{n+2} \dots \pm \frac{1}{n+r}]##

And using this, you can show that:

##\sum_{n = k}^{\infty} \frac{1}{n(n+1) \dots (n+r)} = \frac{1}{r}[\frac{1}{k(k+1) \dots (k+r-1)}]##

With ##F_k = 1## and ##F_n = \frac{n-r}{n-1}F_{n-1} \ \ (n > k)## we have:

##\sum_{n = k}^{\infty} F_{n} = \frac{k-1}{r-2}##

##\sum_{n = k}^{\infty} \frac{F_{n}}{n} = \frac{1}{r-1}##

The general proof follows the same iterative pattern as for ##r = 3##:

##\sum_{n = k}^{\infty} F_{n} = 1 + (\frac{k-r+1}{k}) + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) + \dots + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) \dots (\frac{k-2}{k+r -3})##
##+ (k-1)(k-2) \dots (k-r+1) \sum_{n = k}^{\infty} \frac{1}{n(n+1) \dots (n+r-2)}##

## = 1 + (\frac{k-r+1}{k}) + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) + \dots + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) \dots (\frac{k-2}{k+r -3}) + (k-1)(k-2) \dots (k-r+1) (\frac{1}{r-2})(\frac{1}{k(k+1) \dots (k+r-3)})##

## = 1 + (\frac{k-r+1}{k}) + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) + \dots + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) \dots (\frac{k-2}{k+r -3})[1 + \frac{k-1}{r-2}]##

## = 1 + (\frac{k-r+1}{k}) + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) + \dots + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) \dots (\frac{k-2}{k+r -3})[\frac{k + r-3}{r-2}]##

The term ##(k + r -3)## cancels and the process iterates backwards until:

##\sum_{n = k}^{\infty} F_{n} = 1 + (\frac{k-r+1}{k}) + (\frac{k-r+1}{k})(\frac{k-r+2}{r-2}) = 1 + (\frac{k-r+1}{k})[1 + \frac{k-r+2}{r-2}] = 1 + (\frac{k-r+1}{k})[\frac{k}{r-2}] = 1 + (\frac{k-r+1}{r-2}) = \frac{k-1}{r-2}##

And similarly, you can show that ##\sum_{n = k}^{\infty} \frac{F_{n}}{n} = \frac{1}{r-1}##
 
  • #48
micromass said:
That's an interesting analysis. I'll be waiting for more input before I spill the beans on this one.
Another try on 9).

Now that I 've written a little program to play with strings I've seen two things:
  1. The distribution of the lengths of words is pretty near the expected number for the english language. At least below a word length of 10. Since english consists of many small words compared to other languages this is a strong evidence for an english text although the total number of available words (220 given, 233 in a control text of equal length) isn't too big.
  2. The relative frequency of some letters show an extraordinary high difference to the expected values! This almost for sure excludes the possibility of a CAESAR code or any other bijection of the alphabet. (E.g. 28 "e" as the most frequent letter are expected and 134 occurrences of "u" exist, 109 "y" and 102 "j". )
Conclusion: Either it is a random text pretending to be english by word lengths or it is a code that encrypts words instead of letters.

For the latter speaks the fact that some words occur repeatedly in a strange manner, even long ones. And it is easier to produce by the search and substitute methods of nowadays editors (which would fail for substituting letters).
I'll have to take a closer look on that property.

Meanwhile I change my mind based on the former fast zipping analysis.

The text in 9) is english and not coded by a bijection of the alphabet, but by substitution of whole words or otherwise grouped letters.
 
  • Like
Likes thephystudent
  • #49
I would like to respond to #2. I did a simple statistical test. There are many tests that could be used, but I decided to look at the first head in a sequence and check the number of times it is also followed by a head. Assign a "1" to a head and a 0 to a tail. The upper one, I counted 17 first heads, and 14 of them were followed by a head. For a binomial distribution, ## \mu=Np ## and ## \sigma^2=Npq ## This gives for N=17 and p=1/2 and q=1/2, the mean is 8.5 and ## \sigma=2.1 ##. 14 is nearly 3 ## \sigma ## from the mean: ## z=(14-8.5)/2.1 ##. For the second group with the same test, I counted N=29 such occurrences and 13 subsequent heads. For this case, the mean is 14.5 and ## \sigma=2.7 ## . On this second row, the sample resulted in differing from the mean by much less than 1 ## \sigma ##. Clearly the top row is the one invented by a human and the bottom row is from a coin toss experiment. ( Note: I would like to doublecheck my counting-it was a little hard to read-I think my tallies of 17 and 14, and 29 and 13 were nearly correct). I see @PeroK made a similar "qualitative" observation in post 11.
 
  • #50
Charles Link said:
I would like to respond to #2. I did a simple statistical test. There are many tests that could be used, but I decided to look at the first head in a sequence and check the number of times it is also followed by a head. Assign a "1" to a head and a 0 to a tail. The upper one, I counted 17 first heads, and 14 of them were followed by a head. For a binomial distribution, ## \mu=Np ## and ## \sigma^2=Npq ## This gives for N=17 and p=1/2 and q=1/2, the mean is 8.5 and ## \sigma=2.1 ##. 14 is nearly 3 ## \sigma ## from the mean: ## z=(14-8.5)/2.1 ##. For the second group with the same test, I counted N=29 such occurrences and 13 subsequent heads. For this case, the mean is 14.5 and ## \sigma=2.7 ## . On this second row, the sample resulted in differing from the mean by much less than 1 ## \sigma ##. Clearly the top row is the one invented by a human and the bottom row is from a coin toss experiment. ( Note: I would like to doublecheck my counting-it was a little hard to read-I think my tallies of 17 and 14, and 29 and 13 were nearly correct). I see @PeroK made a similar "qualitative" observation in post 11.

Yes, in the first sequence there seems to be a correlation between the outcome of each toss and the previous one (it's more likely to be the same). For once I was more interested in the psychology than the maths. The normal tendency, I believe, is to do the opposite: try to even things up by favouring a Head afer a Tail and vice versa. If sequence 1 is human generated, then whoever did it appears to have known this and over compensated by favouring another Head after a Head and another Tail after a Tail. Or, perhaps, it was @micromass doing this deliberately!
 
  • Like
Likes Charles Link
  • #51
fresh_42 said:
Another try on 9)...
Correction: I've made a mistake and had used a wrong denominator in one line. (#words instead of #letters) So the picture is a different one concerning the occurrences of letters. However, my next step would be attempts to decode the text, now that I have some clues for setting up a bijection of the alphabet. Since this is forbidden, my last hypothesis is
9) Is an english text.
and now I'll try to decode it.
 
  • #52
PeroK said:
Yes, in the first sequence there seems to be a correlation between the outcome of each toss and the previous one (it's more likely to be the same). For once I was more interested in the psychology than the maths. The normal tendency, I believe, is to do the opposite: try to even things up by favouring a Head afer a Tail and vice versa. If sequence 1 is human generated, then whoever did it appears to have known this and over compensated by favouring another Head after a Head and another Tail after a Tail. Or, perhaps, it was @micromass doing this deliberately!
Hopefully he gives me credit for it already, but if not, a similar test could be done observing the two coin flips after a head first appears. In that case, a simple observation is that there are often two heads following it. I didn't compute it numerically yet, but it looks as if that might give about 4 ## \sigma ## or 5 ## \sigma ## from the mean for the top sequence, and even 3 ## \sigma ## is very unlikely to happen by chance...editing...this second test of the two subsequent coin tosses (if I computed it correctly) gave just over ## 2 \sigma ## from the mean for the first set (and not ## 4 \sigma ## or ## 5 \sigma ##), but it still is an indicator that it was likely man-made.
 
Last edited:
  • #53
Charles Link said:
Hopefully he gives me credit for it already, but if not, a similar test could be done observing the two coin flips after a head first appears. In that case, a simple observation is that there are often two heads following it. I didn't compute it numerically yet, but it looks as if that might give about 4 ## \sigma ## or 5 ## \sigma ## from the mean for the top sequence, and even 3 ## \sigma ## is very unlikely to happen by chance.

This is also the result I seemed to get when observing Shannon-entropy which is a (fancy but strongly physics-related) measure for randomness and thus expected to be higher for the real combination. I seem to have made a mistake last time, because now I also observe the THT-sequence to have higher entropy for virtually any combination

I don't know how to make a table here in physics forums, but the entropies for the lists {(n,n+1)} of two subsequent flips are 1.365 /1. 382; for the lists {(n, n+2)} 1.376/1.383, for the lists {(n,n+30)} I get 1.368/1.382, using {(n,n+100)} gives 1.364/1.383.

And for the lists {(n,n+1,n+2)} I get 2.032/ 2.072.

In conclusion, the THT-sequence is real and the THH-sequence is fake. A disadvantage of my approach is that, unlike the approaches from other people that use the more canonical statistics methods, I cannot give an uncertainty/probability the random sequence accidently seemed more man-made.
 
  • Like
Likes Charles Link
  • #54
I'll let you guys continue to think of this a bit more. I'll give the answer in a few days time.
 
  • #55
@micromass Thank you. I now have a nice little mini Enigma for Caesars, random Letter encryption, word and letter counting, decoding attempts ... guess I'll have to add the RSA feature to finish it.
 
  • #56
fresh_42 said:
@micromass Thank you. I now have a nice little mini Enigma for Caesars, random Letter encryption, word and letter counting, decoding attempts ... guess I'll have to add the RSA feature to finish it.

I might make a cryptography challenge later :smile:
 
  • #57
micromass said:
I might make a cryptography challenge later :smile:
I don't know whether this is good or bad news for my little playground. As far as I know you the chances are high you will put some noise on the channel and I'm not sure whether I want to deal with error correcting mechanisms. :nb)
 
  • #58
fresh_42 said:
I'm not sure whether I want to deal with error correcting mechanisms. :nb)

Of course you do!
 
  • #59
micromass said:
I'll let you guys continue to think of this a bit more. I'll give the answer in a few days time.
I have a more definitive statistical test for #2. The item that separates the first sequence from the second is the number of changes of state that take place (from the previous flip). In 100 coin tosses, there are ## N=99 ## possible changes of state. It's a simple binomial test with ## \mu=Np ## and ## \sigma^2=Npq ##. (with ##p=q=1/2 ##). This gives ## \mu=49.5 ## and ##\sigma=5.0 ##. For the first sequence I counted k=33 changes of state. This gives a ## z=(49.5-33)/5.0 ## that is slightly greater than 3. (result was more than ## 3 \sigma ## from the mean.) For the second sequence I still need to tally it, but I want to post this before someone else beats me to it. ...editing... for the second sequence I counted k=50 (approximately) transitions. (my eyes aren't real steady so the count may be off by one or two.) This second case is well within ## 1 \sigma ## of the mean and is the kind of result to be expected from a random coin toss.
 
Last edited:
  • #60
Problem 1: I don't get it. In what sense "optimal"?