Is Marilyn Vos Savant wrong on this probability question?

Hi everyone,

Someone posed the question:

 Say you plan to roll a die 20 times. Which of these results is more likely: (a) 11111111111111111111, or (b) 66234441536125563152?
and Marilyn (high IQ record holder) answers:

 In theory, the results are equally likely. Both specify the number that must appear each time the die is rolled. (For example, the 10th number in the first series must be a 1. The 10th number in the second series must be a 3.) Each number—1 through 6—has the same chance of landing faceup. But let’s say you tossed a die out of my view and then said that the results were one of the above. Which series is more likely to be the one you threw? Because the roll has already occurred, the answer is (b). It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s.
What do you guys think? You can find the original in the link below.

 PhysOrg.com science news on PhysOrg.com >> City-life changes blackbird personalities, study shows>> Origins of 'The Hoff' crab revealed (w/ Video)>> Older males make better fathers: Mature male beetles work harder, care less about female infidelity

 Quote by CantorSet Hi everyone, Someone posed the question: and Marilyn (high IQ record holder) answers: What do you guys think? You can find the original in the link below. http://www.parade.com/askmarilyn/201...-07-31-11.html
She is right in saying that both can happen. Any permutation of the result is possible (and hence probable).

One intuitive idea that you can use to analyze this situation is entropy. In a system that is purely random (think uniform probability of any possibility occurring), then its more likely that you get a result which reflects this high entropy than something with a lower entropy.

In the case of the first result with all 1's: this is something which has extremely low entropy. The fact that we have only one choice across all trials suggest that if this result is representative of the underlying process, then the process itself has a low measure of randomness (again represented by entropy).

The fact that we are assuming that the process of dice rolling is purely random suggests that we would intuitively reject the result with near zero entropy and instead favor the result with entropy closer to the theoretical entropy of the process, which is the second result.

Like she said, it doesn't mean that the sequence of 1's isn't possible. It just means that in the context of entropy, it is more likely that for a system that has a high amount of entropy, that you would expect a result that is somewhat related to the theoretical entropy of the system based on the fact that the assumptions used to calculate the theoretical entropy are good enough.

We use this idea in physics to explain things like why plates don't just un-break themselves spontaneously in the real world. When we observe physical systems we find that things do get more "random" and as such we have the expectation of seeing things that exhibit this property.

So if you want to think about how you would judge whether you would believe one outcome over another in respect to a given system, think about the results in terms of the entropy of the given system as a good heuristic to gauge some level of likelihood.

 Quote by chiro She is right in saying that both can happen. Any permutation of the result is possible (and hence probable). One intuitive idea that you can use to analyze this situation is entropy. In a system that isn't purely random (think uniform probability of any possibility occurring), then its more likely that you get a result which reflects this high entropy than something with a lower entropy. In the case of the first result with all 1's: this is something which has extremely low entropy. The fact that we have only one choice across all trials suggest that if this result is representative of the underlying process, then the process itself has a low measure of randomness (again represented by entropy). The fact that we are assuming that the process of dice rolling is purely random suggests that we would intuitively reject the result with near zero entropy and instead favor the result with entropy closer to the theoretical entropy of the process, which is the second result. Like she said, it doesn't mean that the sequence of 1's isn't possible. It just means that in the context of entropy, it is more likely that for a system that has a high amount of entropy, that you would expect a result that is somewhat related to the theoretical entropy of the system based on the fact that the assumptions used to calculate the theoretical entropy are good enough. We use this idea in physics to explain things like why plates don't just un-break themselves spontaneously in the real world. When we observe physical systems we find that things do get more "random" and as such we have the expectation of seeing things that exhibit this property. So if you want to think about how you would judge whether you would believe one outcome over another in respect to a given system, think about the results in terms of the entropy of the given system as a good heuristic to gauge some level of likelihood.
Thanks for the response, Chiro.

I'm unclear what entropy means in this context but it sounds interesting. This problem is unsettling because if we let A be the event with the sequence of all 1's, B be the event of the sequence with mixed numbers and let C = A U B. Then, we have P(A) = P(B) and therefore,

$P(A|C) = \frac{P(A)}{P(A)+P(B)} = \frac{P(B)}{P(A)+P(B)} = P(B|C)$

So it seems they should have the same conditional probability. But maybe I'm not setting up these equations correctly. Also, how would we factor in the entropy you're speaking of into these equations?

Is Marilyn Vos Savant wrong on this probability question?

 Quote by CantorSet Thanks for the response, Chiro. I'm unclear what entropy means in this context but it sounds interesting. This problem is unsettling because if we let A be the event with the sequence of all 1's, B be the event of the sequence with mixed numbers and let C = A U B. Then, we have P(A) = P(B) and therefore, $P(A|C) = \frac{P(A)}{P(A)+P(B)} = \frac{P(B)}{P(A)+P(B)} = P(B|C)$ So it seems they should have the same conditional probability. But maybe I'm not setting up these equations correctly. Also, how would we factor in the entropy you're speaking of into these equations?
Entropy is a way of measuring information content. If something is completely random it has a high entropy because all things are equally likely and therefore if you have many things that are equally likely you need to allocate more information to something. For example if there are eight possibilities in a random variable and they are equally likely the entropy in base 2 is 3. But if for example you have another random variable (again with eight possibilities) and probability of one particular event is 0.93 with the others having probability of 0.01, then the entropy will be a lot lower because you will be able to get away with using less information.
 I'm curious too. Could you explain what entropy does mean in this context, using this example. Wikipedia defines information entropy as a property of a random variable. Which random variable are you talking about the entropy of? What is its domain, what is its range, what is the rule that specifies its value for each input? One obvious random variable we have here is the identity function on the set {n | n is an integer, 0 < n < 7}20, whose distribution gives the (equal) probability of the two events {(6, 6, 2, 3, 4, 4, 4, 1, 5, 3, 6, 1, 2, 5, 5, 6, 3, 1, 5, 2)} and {(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)}. But presumably it's not this random variable whose entropy you mean. On the other hand, Wolfram Alpha and Mathematica treat entropy as a property of a list. I'm not sure what their definition is. Are these two kinds of entropy related?

 Quote by Rasalhague I'm curious too. Could you explain what entropy does mean in this context, using this example. Wikipedia defines information entropy as a property of a random variable. Which random variable are you talking about the entropy of? What is its domain, what is its range, what is the rule that specifies its value for each input? One obvious random variable we have here is the identity function on the set {n | n is an integer, 0 < n < 7}20, whose distribution gives the (equal) probability of the two events {(6, 6, 2, 3, 4, 4, 4, 1, 5, 3, 6, 1, 2, 5, 5, 6, 3, 1, 5, 2)} and {(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)}. But presumably it's not this random variable whose entropy you mean. On the other hand, Wolfram Alpha and Mathematica treat entropy as a property of a list. I'm not sure what their definition is. Are these two kinds of entropy related?
Entropy is a concept invariant to any random variable, but for the sake of the examples listed by the OP, the R.V. is simply a uniform random variable (discrete in the OP's case).

Basically higher entropy means more uncertainty which means more "randomness". In terms of how to calculate that, that is well documented in the literature. For applications in physics look at thermodynamics. The idea that things get more random as time passes and the fact that entropy increases is no accident. Entropy allows us to quantify this kind of behavior.

With regards to the dice, the outcome of all 1's in terms of its entropy value is very close to zero. There is only one symbol and no real randomness in that information.

On the other hand in the second sequence of numbers there is variation in symbol type which means higher entropy.

Of both of these, the second sequence has an entropy a lot closer to the theoretical entropy than the first one, so if we were to judge whether it is "more likely" whether the second sequence is come from a particular process than the other, entropy is a quantitative variable that can be used to make an inference.

But you also have to remember that even though there is a big difference in entropy, both results under that process are likely. Entropy does not dictate other-wise, but it does give a quantitative measure of the randomness of some process and hence gives us an indicator of how "likely" some piece of information is to come from a particular process.

If you want a definitive definition of entropy (the one I have been talking about), get Claude Shannon's papers or the Information Theory book by Thomas Cover (latest edition as far as I know is the 2nd edition). The definition broadly is basically sum (or integral) over the domain of - p x log(p).

Recognitions:
Gold Member
Staff Emeritus
Assuming your quote is fair (I can't see the article), Marilyn made an unfortunately common psychological mistake. The question she answered is very different from the question that was asked.
 Quote by Marilyn It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s.
This statement is quite true. But can see why this has no bearing on the following question?
 Quote by Marilyn Which of these results is more likely: (a) 11111111111111111111, or (b) 66234441536125563152?
(I think chiro is making the same mistake -- answering the question of "all 1's versus a mix of all numbers" instead of answering the question "all 1's vs that other specific sequence of numbers")

 Quote by Hurkyl (I think chiro is making the same mistake -- answering the question of "all 1's versus a mix of all numbers" instead of answering the question "all 1's vs that other specific sequence of numbers")
Can you tell me what the mistake is?

The entropy argument also fits in with your second scenario and not just the first. You base your entropy calculations on the number and the frequency of symbols that occur in your information sample. Both of these are incorporated into an entropy calculation and not just one or the other.

Recognitions:
Gold Member
Staff Emeritus
 Quote by chiro Can you tell me what the mistake is?
Not until you tell me why you think entropy has any relevance at all.

I can guess what your line of reasoning is, and it boils down to you answering the question
Which is more likely?
• Getting 20 1's in any order
• Getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's in any order
 Recognitions: Gold Member Science Advisor Staff Emeritus Another way to describe the mistake that might be useful: The greater likelihood of getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's versus 20 1's is exactly canceled out by the unlikelyhood they happen to be arranged in exactly the order 66234441536125563152.

 Quote by Hurkyl Not until you tell me why you think entropy has any relevance at all. I can guess what your line of reasoning is, and it boils down to you answering the questionWhich is more likely? Getting 20 1's in any order Getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's in any order
Both completely depend on the distribution and its properties. If you have some distribution with independence properties like a binomial distribution then the calculation become a lot easier, but if you have something where future trial probability spaces are dependent on prior trials (or even the entire history of trials), then it becomes complicated.

Entropy as stated before gives us a quantitative measure of information content and uncertainty. The most uncertain source of information comes from a random variable where the distribution is uniform, and where new trials in the stochastic process are statistically independent from past trials in that given P(Xn = x | Xn-1 = a(n-1), Xn-2 = a(n-2), ..., X0 = a(0)) = P(Xn = x) which is the strongest form of independence there is.

Now entropy requires us to supply a distribution and based on that we get our entropy in some base. In the uniform random variable (discrete), the entropy is always log2(n) where n is the number of possible states.

Now if we have a process with a large number of trials, strong law arguments should allow us to prove that the entropy of the information source tends to the theoretical entropy of our assumed distribution. This sort of reasoning is used in the classical frequentist approach especially with sampling distributions but we typically derive properties for things like the distribution of a statistic and not something like entropy.

Now I don't have a mathematical formula for the confidence interval for the inference for the entropy of a sample with respect to the entropy of our underlying distribution (but I'm thinking about doing the calculation myself a little later on), but based on the argument that if the number of trials (in this case dice rolls) is large enough, then the entropy of that sample (with respect to the distribution) should approach the theoretical entropy of the assumed distribution.

Also with regard to order, I haven't said that order affects anything. That depends entirely on the distribution. You could write an extremely complex distribution up to trial n where the distribution is a really complicated mathematical expression that is conditional on every prior trial, or you could use the independence assumption (like the binomial, multinomial).

I am simply basing my argument around the idea of creating a distribution with confidence intervals that are based on the assumption that as more information is created, the entropy of the entire information approaches the theoretical value more or less analogously in the same way that the variance of the the mean statistic of a distribution approaches zero and n goes to infinity.

I haven't provided any proofs of anything and I am aware of that, but if I am wrong it would be nice to hear some justification by either a proof or simply a hole in my reasoning.

 Quote by Hurkyl Another way to describe the mistake that might be useful: The greater likelihood of getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's versus 20 1's is exactly canceled out by the unlikelyhood they happen to be arranged in exactly the order 66234441536125563152.
It depends on your assumptions. If your stochastic process has complex dependencies on prior information, then of course things that are the same in the uniform case are not the same in your complex distribution.

If your using your standard purely random assumptions (in this case discrete uniform, any trial isn't affected by any other), then ordering won't have an affect on the entropy of your theoretic distribution.

It's not that the scenario with all 1's is impossible, it's just that it becomes more unlikely as that keeps happening. Picture a confidence interval that is measuring the sample entropy against the theoretical entropy of the assumed distribution. As the information content gets more trials (and more symbols), it is expected that the entropy tends towards the theoretical value, and the shape of the sampling distribution in the context of entropy also shrinks (i.e. variance decreases).

Recognitions:
Gold Member
Staff Emeritus
 Both completely depend on the distribution and its properties. If you have some distribution with independence properties like a binomial distribution then the calculation become a lot easier, but if you have something where future trial probability spaces are dependent on prior trials (or even the entire history of trials), then it becomes complicated.
We're given the distribution: 20 successive rolls of a fair die. The probability is thus uniform over all 20-long sequences of the digits 1,2,3,4,5,6.

In particular, P(11111111111111111111) = P(66234441536125563152).

I'm having trouble making sense of your post. At best, it seems to be a very long winded way of saying "Given a lot of independent trials, the observed frequencies tend towards the probability distribution the samples are drawn from".

But, as I've said before, the question is not:
Which histogram is more likely? (20,0,0,0,0,0) or (3,3,3,3,4,4)?
 This reminds me of how, in biology, my teacher emphasized, assuming equal probability of boy or girl, that even if you have 10 boys, there is equal probability between the possible genders of the 11th. The fact that causes the biology example, and the die example to be true is that, if the preceding events don't affect the probability of subsequent events, then the preceding events are just as probable to happen again as the were. Concretely, the fact that what you roll has no effect on subsequent rolls means that rolling a 1 doesn't make rolling a 1 on your next roll any more or less probable.

 Quote by Hurkyl We're given the distribution: 20 successive rolls of a fair die. The probability is thus uniform over all 20-long sequences of the digits 1,2,3,4,5,6. In particular, P(11111111111111111111) = P(66234441536125563152). I'm having trouble making sense of your post. At best, it seems to be a very long winded way of saying "Given a lot of independent trials, the observed frequencies tend towards the probability distribution the samples are drawn from". But, as I've said before, the question is not:Which histogram is more likely? (20,0,0,0,0,0) or (3,3,3,3,4,4)?
The entropy of having all 1's is zero. The entropy of your right hand side example is a lot higher (I'm just using the standard definition of entropy).

The reason why the first is zero is because P(X = 1) = 1 and log(1) = 0.

Maybe I'm arguing something that I didn't intend to argue, so I'll be clear.

What I'm arguing is that as you get more trials from your process, the entropy of your information sample approaches that the underlying distribution. Probabilistically, getting 111111111 and 635162122 are equally likely and I'm not disagreeing with that. What I'm saying is that I have more "confidence" with the second one coming from our stochastic process than the first.

If I made an error in communication, then I hope it is cleared up now. The probability is the same, but the level of confidence is not, and that is why Vos Savant said that she preferred the second output vs the first.

Recognitions:
Gold Member
Staff Emeritus
 Quote by chiro What I'm saying is that I have more "confidence" with the second one coming from our stochastic process than the first.
What is "confidence"? Is it anything other than "I know the math says one thing, but I don't want to believe it"? (edit: I don't mean to be condescending, but it is really easy to try and rationalize one's intuition when faced with the fact it's simply wrong)

The mistake I mentioned earlier -- here is one way to make that mistake:
I'm going to invent a statistical test: my statistic T is the entropy distribution of observed frequencies. Computing T for 1111... gives a result less likely than computing T for 6623.... Therefore, I infer that 6623... is what was rolled

 Quote by chiro The entropy of having all 1's is zero. The entropy of your right hand side example is a lot higher (I'm just using the standard definition of entropy).
The standard definition uses only probability distribution, not samples/outcomes.
Indeed, the entropy of the empirical distribution tends to the entropy of the underlying distribution, as N (number of throws) tends to infinity. But in our case N=20 is fixed.

 Quote by chiro The probability is the same, but the level of confidence is not
What do you mean by that?
Having a concrete sample - a sequence of 20 throws - does indeed allow you to change your confidence about some assumption, e.g. the hypothesis that the dice is fair, but this is a whole different problem.

 Similar discussions for: Is Marilyn Vos Savant wrong on this probability question? Thread Forum Replies Special & General Relativity 4 Calculus & Beyond Homework 0 General Discussion 35 General Discussion 16