# Is Marilyn Vos Savant wrong on this probability question?

1. Aug 10, 2011

### CantorSet

Hi everyone,

Someone posed the question:

and Marilyn (high IQ record holder) answers:

What do you guys think? You can find the original in the link below.

2. Aug 10, 2011

### chiro

She is right in saying that both can happen. Any permutation of the result is possible (and hence probable).

One intuitive idea that you can use to analyze this situation is entropy. In a system that is purely random (think uniform probability of any possibility occurring), then its more likely that you get a result which reflects this high entropy than something with a lower entropy.

In the case of the first result with all 1's: this is something which has extremely low entropy. The fact that we have only one choice across all trials suggest that if this result is representative of the underlying process, then the process itself has a low measure of randomness (again represented by entropy).

The fact that we are assuming that the process of dice rolling is purely random suggests that we would intuitively reject the result with near zero entropy and instead favor the result with entropy closer to the theoretical entropy of the process, which is the second result.

Like she said, it doesn't mean that the sequence of 1's isn't possible. It just means that in the context of entropy, it is more likely that for a system that has a high amount of entropy, that you would expect a result that is somewhat related to the theoretical entropy of the system based on the fact that the assumptions used to calculate the theoretical entropy are good enough.

We use this idea in physics to explain things like why plates don't just un-break themselves spontaneously in the real world. When we observe physical systems we find that things do get more "random" and as such we have the expectation of seeing things that exhibit this property.

So if you want to think about how you would judge whether you would believe one outcome over another in respect to a given system, think about the results in terms of the entropy of the given system as a good heuristic to gauge some level of likelihood.

Last edited: Aug 11, 2011
3. Aug 10, 2011

### CantorSet

Thanks for the response, Chiro.

I'm unclear what entropy means in this context but it sounds interesting. This problem is unsettling because if we let A be the event with the sequence of all 1's, B be the event of the sequence with mixed numbers and let C = A U B. Then, we have P(A) = P(B) and therefore,

$P(A|C) = \frac{P(A)}{P(A)+P(B)} = \frac{P(B)}{P(A)+P(B)} = P(B|C)$

So it seems they should have the same conditional probability. But maybe I'm not setting up these equations correctly. Also, how would we factor in the entropy you're speaking of into these equations?

4. Aug 10, 2011

### chiro

Entropy is a way of measuring information content. If something is completely random it has a high entropy because all things are equally likely and therefore if you have many things that are equally likely you need to allocate more information to something. For example if there are eight possibilities in a random variable and they are equally likely the entropy in base 2 is 3. But if for example you have another random variable (again with eight possibilities) and probability of one particular event is 0.93 with the others having probability of 0.01, then the entropy will be a lot lower because you will be able to get away with using less information.

5. Aug 11, 2011

### Rasalhague

I'm curious too. Could you explain what entropy does mean in this context, using this example. Wikipedia defines information entropy as a property of a random variable. Which random variable are you talking about the entropy of? What is its domain, what is its range, what is the rule that specifies its value for each input?

One obvious random variable we have here is the identity function on the set {n | n is an integer, 0 < n < 7}20, whose distribution gives the (equal) probability of the two events {(6, 6, 2, 3, 4, 4, 4, 1, 5, 3, 6, 1, 2, 5, 5, 6, 3, 1, 5, 2)} and {(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)}. But presumably it's not this random variable whose entropy you mean.

On the other hand, Wolfram Alpha and Mathematica treat entropy as a property of a list. I'm not sure what their definition is. Are these two kinds of entropy related?

6. Aug 11, 2011

### chiro

Entropy is a concept invariant to any random variable, but for the sake of the examples listed by the OP, the R.V. is simply a uniform random variable (discrete in the OP's case).

Basically higher entropy means more uncertainty which means more "randomness". In terms of how to calculate that, that is well documented in the literature. For applications in physics look at thermodynamics. The idea that things get more random as time passes and the fact that entropy increases is no accident. Entropy allows us to quantify this kind of behavior.

With regards to the dice, the outcome of all 1's in terms of its entropy value is very close to zero. There is only one symbol and no real randomness in that information.

On the other hand in the second sequence of numbers there is variation in symbol type which means higher entropy.

Of both of these, the second sequence has an entropy a lot closer to the theoretical entropy than the first one, so if we were to judge whether it is "more likely" whether the second sequence is come from a particular process than the other, entropy is a quantitative variable that can be used to make an inference.

But you also have to remember that even though there is a big difference in entropy, both results under that process are likely. Entropy does not dictate other-wise, but it does give a quantitative measure of the randomness of some process and hence gives us an indicator of how "likely" some piece of information is to come from a particular process.

If you want a definitive definition of entropy (the one I have been talking about), get Claude Shannon's papers or the Information Theory book by Thomas Cover (latest edition as far as I know is the 2nd edition). The definition broadly is basically sum (or integral) over the domain of - p x log(p).

7. Aug 11, 2011

### Hurkyl

Staff Emeritus
Assuming your quote is fair (I can't see the article), Marilyn made an unfortunately common psychological mistake. The question she answered is very different from the question that was asked.
This statement is quite true. But can see why this has no bearing on the following question?
(I think chiro is making the same mistake -- answering the question of "all 1's versus a mix of all numbers" instead of answering the question "all 1's vs that other specific sequence of numbers")

8. Aug 11, 2011

### chiro

Can you tell me what the mistake is?

The entropy argument also fits in with your second scenario and not just the first. You base your entropy calculations on the number and the frequency of symbols that occur in your information sample. Both of these are incorporated into an entropy calculation and not just one or the other.

9. Aug 11, 2011

### Hurkyl

Staff Emeritus
Not until you tell me why you think entropy has any relevance at all.

I can guess what your line of reasoning is, and it boils down to you answering the question
Which is more likely?
• Getting 20 1's in any order
• Getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's in any order

10. Aug 11, 2011

### Hurkyl

Staff Emeritus
Another way to describe the mistake that might be useful:

The greater likelihood of getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's versus 20 1's is exactly canceled out by the unlikelyhood they happen to be arranged in exactly the order 66234441536125563152.

11. Aug 11, 2011

### chiro

Both completely depend on the distribution and its properties. If you have some distribution with independence properties like a binomial distribution then the calculation become a lot easier, but if you have something where future trial probability spaces are dependent on prior trials (or even the entire history of trials), then it becomes complicated.

Entropy as stated before gives us a quantitative measure of information content and uncertainty. The most uncertain source of information comes from a random variable where the distribution is uniform, and where new trials in the stochastic process are statistically independent from past trials in that given P(Xn = x | Xn-1 = a(n-1), Xn-2 = a(n-2), ..., X0 = a(0)) = P(Xn = x) which is the strongest form of independence there is.

Now entropy requires us to supply a distribution and based on that we get our entropy in some base. In the uniform random variable (discrete), the entropy is always log2(n) where n is the number of possible states.

Now if we have a process with a large number of trials, strong law arguments should allow us to prove that the entropy of the information source tends to the theoretical entropy of our assumed distribution. This sort of reasoning is used in the classical frequentist approach especially with sampling distributions but we typically derive properties for things like the distribution of a statistic and not something like entropy.

Now I don't have a mathematical formula for the confidence interval for the inference for the entropy of a sample with respect to the entropy of our underlying distribution (but I'm thinking about doing the calculation myself a little later on), but based on the argument that if the number of trials (in this case dice rolls) is large enough, then the entropy of that sample (with respect to the distribution) should approach the theoretical entropy of the assumed distribution.

Also with regard to order, I haven't said that order affects anything. That depends entirely on the distribution. You could write an extremely complex distribution up to trial n where the distribution is a really complicated mathematical expression that is conditional on every prior trial, or you could use the independence assumption (like the binomial, multinomial).

I am simply basing my argument around the idea of creating a distribution with confidence intervals that are based on the assumption that as more information is created, the entropy of the entire information approaches the theoretical value more or less analogously in the same way that the variance of the the mean statistic of a distribution approaches zero and n goes to infinity.

I haven't provided any proofs of anything and I am aware of that, but if I am wrong it would be nice to hear some justification by either a proof or simply a hole in my reasoning.

12. Aug 11, 2011

### chiro

It depends on your assumptions. If your stochastic process has complex dependencies on prior information, then of course things that are the same in the uniform case are not the same in your complex distribution.

If your using your standard purely random assumptions (in this case discrete uniform, any trial isn't affected by any other), then ordering won't have an affect on the entropy of your theoretic distribution.

It's not that the scenario with all 1's is impossible, it's just that it becomes more unlikely as that keeps happening. Picture a confidence interval that is measuring the sample entropy against the theoretical entropy of the assumed distribution. As the information content gets more trials (and more symbols), it is expected that the entropy tends towards the theoretical value, and the shape of the sampling distribution in the context of entropy also shrinks (i.e. variance decreases).

13. Aug 11, 2011

### Hurkyl

Staff Emeritus
We're given the distribution: 20 successive rolls of a fair die. The probability is thus uniform over all 20-long sequences of the digits 1,2,3,4,5,6.

In particular, P(11111111111111111111) = P(66234441536125563152).

I'm having trouble making sense of your post. At best, it seems to be a very long winded way of saying "Given a lot of independent trials, the observed frequencies tend towards the probability distribution the samples are drawn from".

But, as I've said before, the question is not:
Which histogram is more likely? (20,0,0,0,0,0) or (3,3,3,3,4,4)?​

Last edited: Aug 11, 2011
14. Aug 11, 2011

### TylerH

This reminds me of how, in biology, my teacher emphasized, assuming equal probability of boy or girl, that even if you have 10 boys, there is equal probability between the possible genders of the 11th.

The fact that causes the biology example, and the die example to be true is that, if the preceding events don't affect the probability of subsequent events, then the preceding events are just as probable to happen again as the were. Concretely, the fact that what you roll has no effect on subsequent rolls means that rolling a 1 doesn't make rolling a 1 on your next roll any more or less probable.

15. Aug 11, 2011

### chiro

The entropy of having all 1's is zero. The entropy of your right hand side example is a lot higher (I'm just using the standard definition of entropy).

The reason why the first is zero is because P(X = 1) = 1 and log(1) = 0.

Maybe I'm arguing something that I didn't intend to argue, so I'll be clear.

What I'm arguing is that as you get more trials from your process, the entropy of your information sample approaches that the underlying distribution. Probabilistically, getting 111111111 and 635162122 are equally likely and I'm not disagreeing with that. What I'm saying is that I have more "confidence" with the second one coming from our stochastic process than the first.

If I made an error in communication, then I hope it is cleared up now. The probability is the same, but the level of confidence is not, and that is why Vos Savant said that she preferred the second output vs the first.

16. Aug 11, 2011

### Hurkyl

Staff Emeritus
What is "confidence"? Is it anything other than "I know the math says one thing, but I don't want to believe it"? (edit: I don't mean to be condescending, but it is really easy to try and rationalize one's intuition when faced with the fact it's simply wrong)

The mistake I mentioned earlier -- here is one way to make that mistake:
I'm going to invent a statistical test: my statistic T is the entropy distribution of observed frequencies. Computing T for 1111... gives a result less likely than computing T for 6623.... Therefore, I infer that 6623... is what was rolled​

Last edited: Aug 11, 2011
17. Aug 11, 2011

The standard definition uses only probability distribution, not samples/outcomes.
Indeed, the entropy of the empirical distribution tends to the entropy of the underlying distribution, as N (number of throws) tends to infinity. But in our case N=20 is fixed.

What do you mean by that?
Having a concrete sample - a sequence of 20 throws - does indeed allow you to change your confidence about some assumption, e.g. the hypothesis that the dice is fair, but this is a whole different problem.

18. Aug 11, 2011

### chiro

By confidence I mean that under a given assumption (in this case the number of trials, the theoretical assumed distribution) and based on the sample (our actual generated information), there is a distribution that formalizes our idea of inference for something to be x% probable.

As a concrete example, lets focus on a statistical parameter: the mean of a distribution. Now we have sample information which is obviously finite. We use the unbiased estimator to determine a good estimate of our mean.

Now we have assume that the underlying distribution of each of the samples (remember they are from the same distribution) is some distribution, lets say normal. So we use statistical theory to calculate the distribution of the normalized difference of the statistic and the parameter (so [x_bar - mu]/root(Var)) and this is normal because the distribution is a linear combination of normal distributions.

So based on this we construct a confidence interval given our assumptions and we find that we have a 95% interval that says that 95% of the time, the true difference between the statistic and the parameter would lie in this interval.

As we increase our sample size we expect the variance of this distribution to go to zero as more samples are added and this is intuitive since our estimate for our true mean (the parameter we are estimating) gets more accurate. This is basically a result of the strong law of large numbers and this kind of thinking is used to generate many classical statistics theory.

With the above in mind, I am proposing the same thing but instead of the mean parameter, I am using the parameter of entropy.

Like with the mean example, you create a distribution that lets you do an inference calculation based on the entropy statistic belonging to a distribution with a theoretical entropy (the parameter in this case). So using the analogue with the mean you get a sampling distribution that instead of being focused on measuring the mean, you instead are focuses on testing the inference that the sample corresponds to having an entropy (the statistics) corresponding to the theoretical entropy (the parameter).

You construct your sampling distribution, and based on that generate a confidence interval for some level of confidence. You base this on (1) the theoretical entropy (from your assumed distribution), (2) your statistic entropy (measured from your sample), and (3) the number of samples. You may actually need more than that depending on the distribution yourself, but lets stick to the basics.

Again like with your mean, your construct an interval based on your level of confidence, and then depending on the result accept or reject the null hypothesis.

Now with the uniform distribution (discrete) the entropy is always maximized, as n increases you would expect the distribution to tighten around the maximum. It is done much in the same way that the distribution of the mean tightens around a specific value.

So in finishing, confidence is basically using probability models to make inference on how "confident" you are on a sample meeting the criteria of some theoretical parameter, which in this case is entropy.

19. Aug 11, 2011

### chiro

See my above post.

20. Aug 11, 2011