Micromass' big statistics challenge

In summary, we discussed various open-ended questions related to probability theory and statistics. Our goal was to find strategies and models to answer these questions and provide reasoning for their plausibility. We also allowed the use of outside sources, as long as they were properly referenced. Some of the questions we covered include estimating the number of fish in a lake, identifying a real coin toss experiment from a human-invented one, and determining the optimal number of seats on a train based on daily ridership. We also examined a scenario involving a person playing a game with a psychic, and discussed strategies for distinguishing between randomly generated text and real text.
  • #36
Math_QED said:
I supposed that the person who has to judge knows there are 5 pepsi's and 5 cola's.

I don't think you did. In either case, your answer is incorrect.
 
Physics news on Phys.org
  • #37
micromass said:
I don't think you did. In either case, your answer is incorrect.

Then I'll wait for someone else to solve it. I don't even know whether I was on the right track.
 
  • #38
PeroK said:
They all reduce to variations on ##\sum \frac{1}{n(n+1)}##. For example:

##r = 3##

##\sum_{n = k}^{\infty} \frac{F_{n}}{n} = \frac{1}{k} + \frac{1}{k+1} \frac{k-2}{k} + \frac{1}{k+2} \frac{k-2}{k} \frac{k-1}{k+1} + \frac{1}{k+3} \frac{k-2}{k} \frac{k-1}{k+1} \frac{k}{k+2} \dots##

## = \frac{1}{k} + \frac{1}{k+1} \frac{k-2}{k} + (k-1)(k-2)[\frac{1}{k(k+1)(k+2)} + \frac{1}{(k+1)(k+2)(k+3)} \dots]##

## = \frac{1}{k} + \frac{1}{k+1} \frac{k-2}{k} + (k-1)(k-2)[\frac{1}{2k(k+1)}]## (Summed using partial fractions)

## = \frac{1}{2}##

Is it possible to derive a closed form for the ##F_n##?
 
  • #39
micromass said:
Is it possible to derive a closed form for the ##F_n##?

For ##n \ge k + r - 1##

##F_n = \frac{(k-1)(k-2) \dots (k-r+1)}{(n-r+1)(n-r+2) \dots (n-1)}##

When you sum the ##F_n## you get ##r - 1## initial terms, then a mutiply of an infinite sum involving the product of ##r-1## consecutive reciprocals.

There's a correction needed I think. I just did the sum for ##r = 3## and:

##\sum_{n = k}^{\infty} F_{n} = \frac{k-1}{r-2}##

So:

##E(n) = \frac{(k-1)(r-1)}{r-2}##

E.g. ##k = 213, r = 5, E(n) = 282.67##
 
  • #40
micromass said:
4. Unstable particles are emitted from a source and decay at a distance xx, a real number that has an exponential distribution with characteristic length λ\lambda. Decay events can be observed only if they occur in a window extending from x=1x=1 to x=20x=20. We observe 66 decays at locations {2,5,12,13,13,16}\{2,5,12,13,13,16\}. What is λ\lambda?

Seems like the approach to take here would be
maximum likelihood estimation.

Do you want us to work this out by hand, or would it be sufficient to post code? (Or does knowing this much disqualify me from answering)
 
  • #41
Ygggdrasil said:
Seems like the approach to take here would be
maximum likelihood estimation.

Do you want us to work this out by hand, or would it be sufficient to post code? (Or does knowing this much disqualify me from answering)

Posting code and the answer would be enough.
 
  • #42
micromass said:
Posting code and the answer would be enough.
Ok.

The strategy to answer #4 is to use maximum likelihood estimation.

In Matlab:
Code:
P = @(x,lambda) exppdf(x,lambda)/(expcdf(20,lambda)-expcdf(1,lambda));
data = [2 5 12 13 13 16];
est = mle(data,'pdf',P,'start',mean(data),'lower',0)
Gives an estimate of λ = 90.2
 
  • #43
For #9, can you confirm whether all 26 letters plus space are allowed in principle in the new language/in the random text generation? And, if random, do yo mean 'pulled from a uniform distribution' or 'I just hit my keyboard'. If it is not random, can we assume the original text was in English? Is there a letter-by letter correspondance, word-by-word correspondance or something more delocalized like a 'fourier transform' or so?
 
  • #44
thephystudent said:
For #9, can you confirm whether all 26 letters plus space are allowed in principle in the new language/in the random text generation?

Yes.

And, if random, do yo mean 'pulled from a uniform distribution' or 'I just hit my keyboard'. If it is not random, can we assume the original text was in English?

No, you cannot assume that the "random text" comes from a uniform distribution (which I think is very clearly not the case if you look at the data). And you cannot assume the original text was English.

Is there a letter-by letter correspondance, word-by-word correspondance or something more delocalized like a 'fourier transform' or so?

I won't give you this information.
 
  • Like
Likes thephystudent
  • #45
micromass said:
No, you cannot assume that the "random text" comes from a uniform distribution (which I think is very clearly not the case if you look at the data). And you cannot assume the original text was English.

The problem is that it seems to me that a number generating system can always be made in a particular way in order to mislead any test you want to do. And coding mechanisms can also be optimized to make the text look like random.

I have a tendency to think the text is real for now, but I don't have sufficient arguments to show it.
 
  • #46
Problem 8

I'll try to tackle this problem in a mostly informal manner. What is of great importance according to my analysis of the problem, is what is not given or implied. So, in the sentence "take turns at washing dishes", what are these turns? are they regular or not? do they favor youngest girl or not? Then, I'll go on with the sentence "There were four breakages". Over what period of time those breakages took place? Going on, "Three of them were caused by the youngest girl". How young is she really?

Let's assume that turns at washing dishes are kept regular and they do not favor any of the girls. Then the remaining two questions will give some clue: if the breakages took place in a sufficiently small period of time i.e. in 16 days or a number with an upper bound relatively small - I cannot give a specific number here, then if "young" is not so young that cannot hold everything that must be washed properly, it may well be the case that she is clumsy, otherwise if she is so young then we just cannot tell if she is clumsy or not - in fact clumsiness would not have its proper meaning. Otherwise, if the time period is sufficiently large then no matter if she is so young or not, it may well be the case that she is not clumsy at all. Now, if turns are not regular i.e. favor for whatever reason one of the three other girls ,then for a sufficiently small period of time, if the youngest is so young, we can't tell if she is clumsy but if not, it may well be the case that she does the breakages on purpose or she is clumsy (with odds favoring the former). For a sufficiently large period of time, no matter if the youngest girl is too young or not, she is not clumsy at all. Finally, if turns at washing dishes favor the youngest girl, then for a sufficiently small period of time, if she is too young we can't tell if she is clumsy or not but if she is not, then it turns out that she is probably clumsy. For a sufficiently large period of time, then no matter if she is too young or not, she is rather not clumsy.

I know I just talked in the abstract, but I cannot think of a way to quantify odds properly for this problem. A rough sketch of a tree diagram could be constructed, but because the odds in the process of branching cannot be defined in a precise numerical manner - e.g. what constitutes a sufficiently small/long period of time, what is the dividing line between too young or not, to name a few, I did this rough and abstract analysis. So, in conclusion I cannot see a way of determining if the youngest girl is clumsy or not provided the data given.

EDIT: in all the above I assume that there is no health problem for the youngest girl and there is no external factor of any kind that can have any kind of influence.
 
Last edited:
  • Like
Likes micromass
  • #47
For question 5:
PeroK said:
##E(n) = \frac{(k-1)(r-1)}{r-2}##

I've managed to prove the general formulas for this now. If anyone is interested:

First, for partial fractions:

##\frac{1}{n(n+1) \dots (n+r)} = \frac{1}{r!}[\frac{1}{n} - \binom{r}{1} \frac{1}{n+1} + \binom{r}{2} \frac{1}{n+2} \dots \pm \frac{1}{n+r}]##

And using this, you can show that:

##\sum_{n = k}^{\infty} \frac{1}{n(n+1) \dots (n+r)} = \frac{1}{r}[\frac{1}{k(k+1) \dots (k+r-1)}]##

With ##F_k = 1## and ##F_n = \frac{n-r}{n-1}F_{n-1} \ \ (n > k)## we have:

##\sum_{n = k}^{\infty} F_{n} = \frac{k-1}{r-2}##

##\sum_{n = k}^{\infty} \frac{F_{n}}{n} = \frac{1}{r-1}##

The general proof follows the same iterative pattern as for ##r = 3##:

##\sum_{n = k}^{\infty} F_{n} = 1 + (\frac{k-r+1}{k}) + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) + \dots + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) \dots (\frac{k-2}{k+r -3})##
##+ (k-1)(k-2) \dots (k-r+1) \sum_{n = k}^{\infty} \frac{1}{n(n+1) \dots (n+r-2)}##

## = 1 + (\frac{k-r+1}{k}) + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) + \dots + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) \dots (\frac{k-2}{k+r -3}) + (k-1)(k-2) \dots (k-r+1) (\frac{1}{r-2})(\frac{1}{k(k+1) \dots (k+r-3)})##

## = 1 + (\frac{k-r+1}{k}) + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) + \dots + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) \dots (\frac{k-2}{k+r -3})[1 + \frac{k-1}{r-2}]##

## = 1 + (\frac{k-r+1}{k}) + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) + \dots + (\frac{k-r+1}{k})(\frac{k-r+2}{k+1}) \dots (\frac{k-2}{k+r -3})[\frac{k + r-3}{r-2}]##

The term ##(k + r -3)## cancels and the process iterates backwards until:

##\sum_{n = k}^{\infty} F_{n} = 1 + (\frac{k-r+1}{k}) + (\frac{k-r+1}{k})(\frac{k-r+2}{r-2}) = 1 + (\frac{k-r+1}{k})[1 + \frac{k-r+2}{r-2}] = 1 + (\frac{k-r+1}{k})[\frac{k}{r-2}] = 1 + (\frac{k-r+1}{r-2}) = \frac{k-1}{r-2}##

And similarly, you can show that ##\sum_{n = k}^{\infty} \frac{F_{n}}{n} = \frac{1}{r-1}##
 
  • #48
micromass said:
That's an interesting analysis. I'll be waiting for more input before I spill the beans on this one.
Another try on 9).

Now that I 've written a little program to play with strings I've seen two things:
  1. The distribution of the lengths of words is pretty near the expected number for the english language. At least below a word length of 10. Since english consists of many small words compared to other languages this is a strong evidence for an english text although the total number of available words (220 given, 233 in a control text of equal length) isn't too big.
  2. The relative frequency of some letters show an extraordinary high difference to the expected values! This almost for sure excludes the possibility of a CAESAR code or any other bijection of the alphabet. (E.g. 28 "e" as the most frequent letter are expected and 134 occurrences of "u" exist, 109 "y" and 102 "j". )
Conclusion: Either it is a random text pretending to be english by word lengths or it is a code that encrypts words instead of letters.

For the latter speaks the fact that some words occur repeatedly in a strange manner, even long ones. And it is easier to produce by the search and substitute methods of nowadays editors (which would fail for substituting letters).
I'll have to take a closer look on that property.

Meanwhile I change my mind based on the former fast zipping analysis.

The text in 9) is english and not coded by a bijection of the alphabet, but by substitution of whole words or otherwise grouped letters.
 
  • Like
Likes thephystudent
  • #49
I would like to respond to #2. I did a simple statistical test. There are many tests that could be used, but I decided to look at the first head in a sequence and check the number of times it is also followed by a head. Assign a "1" to a head and a 0 to a tail. The upper one, I counted 17 first heads, and 14 of them were followed by a head. For a binomial distribution, ## \mu=Np ## and ## \sigma^2=Npq ## This gives for N=17 and p=1/2 and q=1/2, the mean is 8.5 and ## \sigma=2.1 ##. 14 is nearly 3 ## \sigma ## from the mean: ## z=(14-8.5)/2.1 ##. For the second group with the same test, I counted N=29 such occurrences and 13 subsequent heads. For this case, the mean is 14.5 and ## \sigma=2.7 ## . On this second row, the sample resulted in differing from the mean by much less than 1 ## \sigma ##. Clearly the top row is the one invented by a human and the bottom row is from a coin toss experiment. ( Note: I would like to doublecheck my counting-it was a little hard to read-I think my tallies of 17 and 14, and 29 and 13 were nearly correct). I see @PeroK made a similar "qualitative" observation in post 11.
 
  • #50
Charles Link said:
I would like to respond to #2. I did a simple statistical test. There are many tests that could be used, but I decided to look at the first head in a sequence and check the number of times it is also followed by a head. Assign a "1" to a head and a 0 to a tail. The upper one, I counted 17 first heads, and 14 of them were followed by a head. For a binomial distribution, ## \mu=Np ## and ## \sigma^2=Npq ## This gives for N=17 and p=1/2 and q=1/2, the mean is 8.5 and ## \sigma=2.1 ##. 14 is nearly 3 ## \sigma ## from the mean: ## z=(14-8.5)/2.1 ##. For the second group with the same test, I counted N=29 such occurrences and 13 subsequent heads. For this case, the mean is 14.5 and ## \sigma=2.7 ## . On this second row, the sample resulted in differing from the mean by much less than 1 ## \sigma ##. Clearly the top row is the one invented by a human and the bottom row is from a coin toss experiment. ( Note: I would like to doublecheck my counting-it was a little hard to read-I think my tallies of 17 and 14, and 29 and 13 were nearly correct). I see @PeroK made a similar "qualitative" observation in post 11.

Yes, in the first sequence there seems to be a correlation between the outcome of each toss and the previous one (it's more likely to be the same). For once I was more interested in the psychology than the maths. The normal tendency, I believe, is to do the opposite: try to even things up by favouring a Head afer a Tail and vice versa. If sequence 1 is human generated, then whoever did it appears to have known this and over compensated by favouring another Head after a Head and another Tail after a Tail. Or, perhaps, it was @micromass doing this deliberately!
 
  • Like
Likes Charles Link
  • #51
fresh_42 said:
Another try on 9)...
Correction: I've made a mistake and had used a wrong denominator in one line. (#words instead of #letters) So the picture is a different one concerning the occurrences of letters. However, my next step would be attempts to decode the text, now that I have some clues for setting up a bijection of the alphabet. Since this is forbidden, my last hypothesis is
9) Is an english text.
and now I'll try to decode it.
 
  • #52
PeroK said:
Yes, in the first sequence there seems to be a correlation between the outcome of each toss and the previous one (it's more likely to be the same). For once I was more interested in the psychology than the maths. The normal tendency, I believe, is to do the opposite: try to even things up by favouring a Head afer a Tail and vice versa. If sequence 1 is human generated, then whoever did it appears to have known this and over compensated by favouring another Head after a Head and another Tail after a Tail. Or, perhaps, it was @micromass doing this deliberately!
Hopefully he gives me credit for it already, but if not, a similar test could be done observing the two coin flips after a head first appears. In that case, a simple observation is that there are often two heads following it. I didn't compute it numerically yet, but it looks as if that might give about 4 ## \sigma ## or 5 ## \sigma ## from the mean for the top sequence, and even 3 ## \sigma ## is very unlikely to happen by chance...editing...this second test of the two subsequent coin tosses (if I computed it correctly) gave just over ## 2 \sigma ## from the mean for the first set (and not ## 4 \sigma ## or ## 5 \sigma ##), but it still is an indicator that it was likely man-made.
 
Last edited:
  • #53
Charles Link said:
Hopefully he gives me credit for it already, but if not, a similar test could be done observing the two coin flips after a head first appears. In that case, a simple observation is that there are often two heads following it. I didn't compute it numerically yet, but it looks as if that might give about 4 ## \sigma ## or 5 ## \sigma ## from the mean for the top sequence, and even 3 ## \sigma ## is very unlikely to happen by chance.

This is also the result I seemed to get when observing Shannon-entropy which is a (fancy but strongly physics-related) measure for randomness and thus expected to be higher for the real combination. I seem to have made a mistake last time, because now I also observe the THT-sequence to have higher entropy for virtually any combination

I don't know how to make a table here in physics forums, but the entropies for the lists {(n,n+1)} of two subsequent flips are 1.365 /1. 382; for the lists {(n, n+2)} 1.376/1.383, for the lists {(n,n+30)} I get 1.368/1.382, using {(n,n+100)} gives 1.364/1.383.

And for the lists {(n,n+1,n+2)} I get 2.032/ 2.072.

In conclusion, the THT-sequence is real and the THH-sequence is fake. A disadvantage of my approach is that, unlike the approaches from other people that use the more canonical statistics methods, I cannot give an uncertainty/probability the random sequence accidently seemed more man-made.
 
  • Like
Likes Charles Link
  • #54
I'll let you guys continue to think of this a bit more. I'll give the answer in a few days time.
 
  • #55
@micromass Thank you. I now have a nice little mini Enigma for Caesars, random Letter encryption, word and letter counting, decoding attempts ... guess I'll have to add the RSA feature to finish it.
 
  • #56
fresh_42 said:
@micromass Thank you. I now have a nice little mini Enigma for Caesars, random Letter encryption, word and letter counting, decoding attempts ... guess I'll have to add the RSA feature to finish it.

I might make a cryptography challenge later :smile:
 
  • #57
micromass said:
I might make a cryptography challenge later :smile:
I don't know whether this is good or bad news for my little playground. As far as I know you the chances are high you will put some noise on the channel and I'm not sure whether I want to deal with error correcting mechanisms. :nb)
 
  • #58
fresh_42 said:
I'm not sure whether I want to deal with error correcting mechanisms. :nb)

Of course you do!
 
  • #59
micromass said:
I'll let you guys continue to think of this a bit more. I'll give the answer in a few days time.
I have a more definitive statistical test for #2. The item that separates the first sequence from the second is the number of changes of state that take place (from the previous flip). In 100 coin tosses, there are ## N=99 ## possible changes of state. It's a simple binomial test with ## \mu=Np ## and ## \sigma^2=Npq ##. (with ##p=q=1/2 ##). This gives ## \mu=49.5 ## and ##\sigma=5.0 ##. For the first sequence I counted k=33 changes of state. This gives a ## z=(49.5-33)/5.0 ## that is slightly greater than 3. (result was more than ## 3 \sigma ## from the mean.) For the second sequence I still need to tally it, but I want to post this before someone else beats me to it. ...editing... for the second sequence I counted k=50 (approximately) transitions. (my eyes aren't real steady so the count may be off by one or two.) This second case is well within ## 1 \sigma ## of the mean and is the kind of result to be expected from a random coin toss.
 
Last edited:
  • #60
Problem 1: I don't get it. In what sense "optimal"?
 
  • #61
Erland said:
Problem 1: I don't get it. In what sense "optimal"?

That's for you to decide on. I left the question very open for this purpose. All I'm giving is the number of people on the train on several days, the question is how you would decide how many seats the train should have. Clearly we don't want it be too long since that's a waste of money. But we don't want it to be too short either, otherwise we get complaints about people not finding seats.
 
  • #62
micromass said:
I left the question very open for this purpose.
Darn it micromass...
 
  • #63
micromass said:
That's for you to decide on. I left the question very open for this purpose. All I'm giving is the number of people on the train on several days, the question is how you would decide how many seats the train should have. Clearly we don't want it be too long since that's a waste of money. But we don't want it to be too short either, otherwise we get complaints about people not finding seats.
Ok, then I say that the optimal number of seats is 254. Obviously, more than 254 seats would just be a waste of money. On the other hand, if you have fewer than 254 seats, then the orchs who miss the train on Day 3 because of lack of seats would get very angry and kill you, and your life is worth more than the cost of a few extra seats.
 
  • #64
ProfuselyQuarky said:
Darn it micromass...
Guess there is a completely different solution for Japanese subways ...
 
  • Like
Likes ProfuselyQuarky
  • #65
Trains:
Assuming the number of passengers is a Gaussian distribution, we can estimate its mean and standard deviation as 226 +- 20. If we plan for 266 seats, we have people standing with a probability of about 2% - and even then, just a small number has to stand. We probably cannot choose exactly 266 seats and it is not necessary either, something around that value should be fine.

Real passenger numbers do not follow Gaussian distributions, however - they have much longer tails. A good railroad company would check if a larger capacity might be needed (e. g. some large celebrations at Mordor or Rohan), and take a longer train (or even a second train) in that case. Assuming middle Earth follows a weekly cycle like Earth, we might also consider Fridays and the weekends separately.Coin tosses: I'll check the numbers tomorrow, but just by looking at the sequences, the second one is completely missing longer runs of the same side, it is generated by a human.
 
  • #66
Another way of doing #2

If the sequence was generated by a coin toss, then it is ##Binomial(199,0.5)##. Since each flip is independent, the probability that flip ##n+1## is identical to flip ##n## is ##0.5##, so the number of repetitions is ##Binomial(198,0.5)##.

Let ##k## be the number of repetitions. We want to estimate ##p = 0.5##.
The first sequence contains 117 repetitions, which gives (under a uniform prior) a posterior distribution of ##Beta(1 + 117, 198 - 117 + 1)##, which is greater than ##.5## with probability ##>0.99##.

The second contains 96 repetitions, which gives (under a uniform prior) a posterior distribution of ##Beta(1 + 96, 198 - 96 + 1)##, which is compatible with ##p = 0.5##.
 
  • #67
For number 6:

This really depends on what you mean by "can tell the difference". If by by "guessing" we mean flipping a fair coin, then the number of successes by guessing is ##Binomial(10,p=0.5)##. Under a flat prior, 8 successes gives a posterior of ##Beta(9,3)##, which suggests that ##p > 0.5## with probability ##0.98##, which is probably enough for me to accept a low risk claim like "can distinguish between coke and pepsi". You could do some kind of formal model comparison, but I hate that kind of stuff.

For number 1

0 seats. Nobody in Rohan is going to Mordor consensually, and Rohan doesn't want Orc refugees.
 
  • #68
Number Nine said:
.
The first sequence contains 117 repetitions, which gives (under a uniform prior) a posterior distribution of ##Beta(1 + 117, 198 - 117 + 1)##, which is greater than ##.5## with probability ##>0.99##.
One concern that is difficult to accurately quantify is the fact that we are all looking for statistics which will reflect some unlikely condition that is manifested by one of the two strings of coin flips. The more tests we run looking for unlikely coincidences, the more likely it is that we will find an unlikely coincidence. If, for instance, one runs 50 [independent] statistical tests on each of two random data sets then it is about 63% likely that one of those will claim "non-random" at the 99% confidence level.
 
  • #69
7.

We have 7 days thus 7 bins and 12 tickets. Without regard to distribution we can say that there are C(7-1+12,12) ways to get 12 tickets. Furthermore, we have C(2-1+12/12) ways to get 12 tickets on just Tuesday and Thursday (assuming we care to distinguished between tickets). Thus the professor has a 13/18564 probability of getting a ticket on Tuesday and Thursday. As for if it's worth getting a garage on those days or not depends on the valuation of the professor and time lose (or gained) on getting a garage versus the cost of a ticket.

For example if the garage isa 30 minute walk through a rough neighborhood and the parking ticket is 10 buckets, then the person may choose the ticket. If the garage is 1 minute further and cost 10 dollars but the ticket is 100, then the garage is probably a good idea.
 
  • #70
jbriggs444 said:
One concern that is difficult to accurately quantify is the fact that we are all looking for statistics which will reflect some unlikely condition that is manifested by one of the two strings of coin flips. The more tests we run looking for unlikely coincidences, the more likely it is that we will find an unlikely coincidence. If, for instance, one runs 50 [independent] statistical tests on each of two random data sets then it is about 63% likely that one of those will claim "non-random" at the 99% confidence level.

Yes, there are any number of statistics one could look at to quantify "non-randomness", so this will probably be a problem no matter which approach we use. The number of repetitions is just the simplest, and probably one of the most plausible places in which to find a difference, since most people tend to misjudge the expected amount of repetition when constructing "random" sequences.

The best approach might be to test a variety of features and look for consensus.
 
Back
Top