Is Marilyn Vos Savant wrong on this probability question?

CantorSet · Aug 10, 2011

Hi everyone,

Someone posed the question:

Say you plan to roll a die 20 times. Which of these results is more likely: (a) 11111111111111111111, or (b) 66234441536125563152?

and Marilyn (high IQ record holder) answers:

In theory, the results are equally likely. Both specify the number that must appear each time the die is rolled. (For example, the 10th number in the first series must be a 1. The 10th number in the second series must be a 3.) Each number—1 through 6—has the same chance of landing faceup.

But let’s say you tossed a die out of my view and then said that the results were one of the above. Which series is more likely to be the one you threw? Because the roll has already occurred, the answer is (b). It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s.

What do you guys think? You can find the original in the link below.

http://www.parade.com/askmarilyn/2011/07/Sundays-Column-07-31-11.html

chiro · Aug 10, 2011

CantorSet said:

Hi everyone,

Someone posed the question:
and Marilyn (high IQ record holder) answers:
What do you guys think? You can find the original in the link below.

http://www.parade.com/askmarilyn/2011/07/Sundays-Column-07-31-11.html

She is right in saying that both can happen. Any permutation of the result is possible (and hence probable).

One intuitive idea that you can use to analyze this situation is entropy. In a system that is purely random (think uniform probability of any possibility occurring), then its more likely that you get a result which reflects this high entropy than something with a lower entropy.

In the case of the first result with all 1's: this is something which has extremely low entropy. The fact that we have only one choice across all trials suggest that if this result is representative of the underlying process, then the process itself has a low measure of randomness (again represented by entropy).

The fact that we are assuming that the process of dice rolling is purely random suggests that we would intuitively reject the result with near zero entropy and instead favor the result with entropy closer to the theoretical entropy of the process, which is the second result.

Like she said, it doesn't mean that the sequence of 1's isn't possible. It just means that in the context of entropy, it is more likely that for a system that has a high amount of entropy, that you would expect a result that is somewhat related to the theoretical entropy of the system based on the fact that the assumptions used to calculate the theoretical entropy are good enough.

We use this idea in physics to explain things like why plates don't just un-break themselves spontaneously in the real world. When we observe physical systems we find that things do get more "random" and as such we have the expectation of seeing things that exhibit this property.

So if you want to think about how you would judge whether you would believe one outcome over another in respect to a given system, think about the results in terms of the entropy of the given system as a good heuristic to gauge some level of likelihood.

CantorSet · Aug 10, 2011

chiro said:

She is right in saying that both can happen. Any permutation of the result is possible (and hence probable).

One intuitive idea that you can use to analyze this situation is entropy. In a system that isn't purely random (think uniform probability of any possibility occurring), then its more likely that you get a result which reflects this high entropy than something with a lower entropy.

In the case of the first result with all 1's: this is something which has extremely low entropy. The fact that we have only one choice across all trials suggest that if this result is representative of the underlying process, then the process itself has a low measure of randomness (again represented by entropy).

The fact that we are assuming that the process of dice rolling is purely random suggests that we would intuitively reject the result with near zero entropy and instead favor the result with entropy closer to the theoretical entropy of the process, which is the second result.

Like she said, it doesn't mean that the sequence of 1's isn't possible. It just means that in the context of entropy, it is more likely that for a system that has a high amount of entropy, that you would expect a result that is somewhat related to the theoretical entropy of the system based on the fact that the assumptions used to calculate the theoretical entropy are good enough.

We use this idea in physics to explain things like why plates don't just un-break themselves spontaneously in the real world. When we observe physical systems we find that things do get more "random" and as such we have the expectation of seeing things that exhibit this property.

So if you want to think about how you would judge whether you would believe one outcome over another in respect to a given system, think about the results in terms of the entropy of the given system as a good heuristic to gauge some level of likelihood.

Thanks for the response, Chiro.

I'm unclear what entropy means in this context but it sounds interesting. This problem is unsettling because if we let A be the event with the sequence of all 1's, B be the event of the sequence with mixed numbers and let C = A U B. Then, we have P(A) = P(B) and therefore,

[itex]P(A|C) = \frac{P(A)}{P(A)+P(B)} = \frac{P(B)}{P(A)+P(B)} = P(B|C)[/itex]

So it seems they should have the same conditional probability. But maybe I'm not setting up these equations correctly. Also, how would we factor in the entropy you're speaking of into these equations?

chiro · Aug 10, 2011

CantorSet said:

Thanks for the response, Chiro.

I'm unclear what entropy means in this context but it sounds interesting. This problem is unsettling because if we let A be the event with the sequence of all 1's, B be the event of the sequence with mixed numbers and let C = A U B. Then, we have P(A) = P(B) and therefore,

[itex]P(A|C) = \frac{P(A)}{P(A)+P(B)} = \frac{P(B)}{P(A)+P(B)} = P(B|C)[/itex]

So it seems they should have the same conditional probability. But maybe I'm not setting up these equations correctly. Also, how would we factor in the entropy you're speaking of into these equations?

Entropy is a way of measuring information content. If something is completely random it has a high entropy because all things are equally likely and therefore if you have many things that are equally likely you need to allocate more information to something. For example if there are eight possibilities in a random variable and they are equally likely the entropy in base 2 is 3. But if for example you have another random variable (again with eight possibilities) and probability of one particular event is 0.93 with the others having probability of 0.01, then the entropy will be a lot lower because you will be able to get away with using less information.

Rasalhague · Aug 11, 2011

I'm curious too. Could you explain what entropy does mean in this context, using this example. Wikipedia defines information entropy as a property of a random variable. Which random variable are you talking about the entropy of? What is its domain, what is its range, what is the rule that specifies its value for each input?

One obvious random variable we have here is the identity function on the set {n | n is an integer, 0 < n < 7}²⁰, whose distribution gives the (equal) probability of the two events {(6, 6, 2, 3, 4, 4, 4, 1, 5, 3, 6, 1, 2, 5, 5, 6, 3, 1, 5, 2)} and {(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)}. But presumably it's not this random variable whose entropy you mean.

On the other hand, Wolfram Alpha and Mathematica treat entropy as a property of a list. I'm not sure what their definition is. Are these two kinds of entropy related?

chiro · Aug 11, 2011

Rasalhague said:

I'm curious too. Could you explain what entropy does mean in this context, using this example. Wikipedia defines information entropy as a property of a random variable. Which random variable are you talking about the entropy of? What is its domain, what is its range, what is the rule that specifies its value for each input?

One obvious random variable we have here is the identity function on the set {n | n is an integer, 0 < n < 7}²⁰, whose distribution gives the (equal) probability of the two events {(6, 6, 2, 3, 4, 4, 4, 1, 5, 3, 6, 1, 2, 5, 5, 6, 3, 1, 5, 2)} and {(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)}. But presumably it's not this random variable whose entropy you mean.

On the other hand, Wolfram Alpha and Mathematica treat entropy as a property of a list. I'm not sure what their definition is. Are these two kinds of entropy related?

Entropy is a concept invariant to any random variable, but for the sake of the examples listed by the OP, the R.V. is simply a uniform random variable (discrete in the OP's case).

Basically higher entropy means more uncertainty which means more "randomness". In terms of how to calculate that, that is well documented in the literature. For applications in physics look at thermodynamics. The idea that things get more random as time passes and the fact that entropy increases is no accident. Entropy allows us to quantify this kind of behavior.

With regards to the dice, the outcome of all 1's in terms of its entropy value is very close to zero. There is only one symbol and no real randomness in that information.

On the other hand in the second sequence of numbers there is variation in symbol type which means higher entropy.

Of both of these, the second sequence has an entropy a lot closer to the theoretical entropy than the first one, so if we were to judge whether it is "more likely" whether the second sequence is come from a particular process than the other, entropy is a quantitative variable that can be used to make an inference.

But you also have to remember that even though there is a big difference in entropy, both results under that process are likely. Entropy does not dictate other-wise, but it does give a quantitative measure of the randomness of some process and hence gives us an indicator of how "likely" some piece of information is to come from a particular process.

If you want a definitive definition of entropy (the one I have been talking about), get Claude Shannon's papers or the Information Theory book by Thomas Cover (latest edition as far as I know is the 2nd edition). The definition broadly is basically sum (or integral) over the domain of - p x log(p).

Hurkyl · Aug 11, 2011

Assuming your quote is fair (I can't see the article), Marilyn made an unfortunately common psychological mistake. The question she answered is very different from the question that was asked.

Marilyn said:

It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s.

This statement is quite true. But can see why this has no bearing on the following question?

Marilyn said:

Which of these results is more likely: (a) 11111111111111111111, or (b) 66234441536125563152?

(I think chiro is making the same mistake -- answering the question of "all 1's versus a mix of all numbers" instead of answering the question "all 1's vs that other specific sequence of numbers")

chiro · Aug 11, 2011

Hurkyl said:

(I think chiro is making the same mistake -- answering the question of "all 1's versus a mix of all numbers" instead of answering the question "all 1's vs that other specific sequence of numbers")

Can you tell me what the mistake is?

The entropy argument also fits in with your second scenario and not just the first. You base your entropy calculations on the number and the frequency of symbols that occur in your information sample. Both of these are incorporated into an entropy calculation and not just one or the other.

Hurkyl · Aug 11, 2011

chiro said:

Can you tell me what the mistake is?

Not until you tell me why you think entropy has any relevance at all.

I can guess what your line of reasoning is, and it boils down to you answering the question

Which is more likely?

Getting 20 1's in any order
Getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's in any order

Hurkyl · Aug 11, 2011

Another way to describe the mistake that might be useful:

The greater likelihood of getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's versus 20 1's is exactly canceled out by the unlikelyhood they happen to be arranged in exactly the order 66234441536125563152.

chiro · Aug 11, 2011

Hurkyl said:

Not until you tell me why you think entropy has any relevance at all.

I can guess what your line of reasoning is, and it boils down to you answering the question

Which is more likely?

Getting 20 1's in any order

Getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's in any order

Both completely depend on the distribution and its properties. If you have some distribution with independence properties like a binomial distribution then the calculation become a lot easier, but if you have something where future trial probability spaces are dependent on prior trials (or even the entire history of trials), then it becomes complicated.

Entropy as stated before gives us a quantitative measure of information content and uncertainty. The most uncertain source of information comes from a random variable where the distribution is uniform, and where new trials in the stochastic process are statistically independent from past trials in that given P(Xn = x | Xn-1 = a(n-1), Xn-2 = a(n-2), ..., X0 = a(0)) = P(Xn = x) which is the strongest form of independence there is.

Now entropy requires us to supply a distribution and based on that we get our entropy in some base. In the uniform random variable (discrete), the entropy is always log2(n) where n is the number of possible states.

Now if we have a process with a large number of trials, strong law arguments should allow us to prove that the entropy of the information source tends to the theoretical entropy of our assumed distribution. This sort of reasoning is used in the classical frequentist approach especially with sampling distributions but we typically derive properties for things like the distribution of a statistic and not something like entropy.

Now I don't have a mathematical formula for the confidence interval for the inference for the entropy of a sample with respect to the entropy of our underlying distribution (but I'm thinking about doing the calculation myself a little later on), but based on the argument that if the number of trials (in this case dice rolls) is large enough, then the entropy of that sample (with respect to the distribution) should approach the theoretical entropy of the assumed distribution.

Also with regard to order, I haven't said that order affects anything. That depends entirely on the distribution. You could write an extremely complex distribution up to trial n where the distribution is a really complicated mathematical expression that is conditional on every prior trial, or you could use the independence assumption (like the binomial, multinomial).

I am simply basing my argument around the idea of creating a distribution with confidence intervals that are based on the assumption that as more information is created, the entropy of the entire information approaches the theoretical value more or less analogously in the same way that the variance of the the mean statistic of a distribution approaches zero and n goes to infinity.

I haven't provided any proofs of anything and I am aware of that, but if I am wrong it would be nice to hear some justification by either a proof or simply a hole in my reasoning.

chiro · Aug 11, 2011

Hurkyl said:

Another way to describe the mistake that might be useful:

The greater likelihood of getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's versus 20 1's is exactly canceled out by the unlikelyhood they happen to be arranged in exactly the order 66234441536125563152.

It depends on your assumptions. If your stochastic process has complex dependencies on prior information, then of course things that are the same in the uniform case are not the same in your complex distribution.

If your using your standard purely random assumptions (in this case discrete uniform, any trial isn't affected by any other), then ordering won't have an affect on the entropy of your theoretic distribution.

It's not that the scenario with all 1's is impossible, it's just that it becomes more unlikely as that keeps happening. Picture a confidence interval that is measuring the sample entropy against the theoretical entropy of the assumed distribution. As the information content gets more trials (and more symbols), it is expected that the entropy tends towards the theoretical value, and the shape of the sampling distribution in the context of entropy also shrinks (i.e. variance decreases).

Hurkyl · Aug 11, 2011

Both completely depend on the distribution and its properties. If you have some distribution with independence properties like a binomial distribution then the calculation become a lot easier, but if you have something where future trial probability spaces are dependent on prior trials (or even the entire history of trials), then it becomes complicated.

We're given the distribution: 20 successive rolls of a fair die. The probability is thus uniform over all 20-long sequences of the digits 1,2,3,4,5,6.

In particular, P(11111111111111111111) = P(66234441536125563152).

I'm having trouble making sense of your post. At best, it seems to be a very long winded way of saying "Given a lot of independent trials, the observed frequencies tend towards the probability distribution the samples are drawn from".

But, as I've said before, the question is not:

Which histogram is more likely? (20,0,0,0,0,0) or (3,3,3,3,4,4)?

TylerH · Aug 11, 2011

This reminds me of how, in biology, my teacher emphasized, assuming equal probability of boy or girl, that even if you have 10 boys, there is equal probability between the possible genders of the 11th.

The fact that causes the biology example, and the die example to be true is that, if the preceding events don't affect the probability of subsequent events, then the preceding events are just as probable to happen again as the were. Concretely, the fact that what you roll has no effect on subsequent rolls means that rolling a 1 doesn't make rolling a 1 on your next roll any more or less probable.

chiro · Aug 11, 2011

Hurkyl said:

We're given the distribution: 20 successive rolls of a fair die. The probability is thus uniform over all 20-long sequences of the digits 1,2,3,4,5,6.

In particular, P(11111111111111111111) = P(66234441536125563152).

I'm having trouble making sense of your post. At best, it seems to be a very long winded way of saying "Given a lot of independent trials, the observed frequencies tend towards the probability distribution the samples are drawn from".

But, as I've said before, the question is not:

Which histogram is more likely? (20,0,0,0,0,0) or (3,3,3,3,4,4)?

The entropy of having all 1's is zero. The entropy of your right hand side example is a lot higher (I'm just using the standard definition of entropy).

The reason why the first is zero is because P(X = 1) = 1 and log(1) = 0.

Maybe I'm arguing something that I didn't intend to argue, so I'll be clear.

What I'm arguing is that as you get more trials from your process, the entropy of your information sample approaches that the underlying distribution. Probabilistically, getting 111111111 and 635162122 are equally likely and I'm not disagreeing with that. What I'm saying is that I have more "confidence" with the second one coming from our stochastic process than the first.

If I made an error in communication, then I hope it is cleared up now. The probability is the same, but the level of confidence is not, and that is why Vos Savant said that she preferred the second output vs the first.

Hurkyl · Aug 11, 2011

chiro said:

What I'm saying is that I have more "confidence" with the second one coming from our stochastic process than the first.

What is "confidence"? Is it anything other than "I know the math says one thing, but I don't want to believe it"? (edit: I don't mean to be condescending, but it is really easy to try and rationalize one's intuition when faced with the fact it's simply wrong)

The mistake I mentioned earlier -- here is one way to make that mistake:

I'm going to invent a statistical test: my statistic T is the entropy distribution of observed frequencies. Computing T for 1111... gives a result less likely than computing T for 6623... Therefore, I infer that 6623... is what was rolled

vladb · Aug 11, 2011

chiro said:

The entropy of having all 1's is zero. The entropy of your right hand side example is a lot higher (I'm just using the standard definition of entropy).

The standard definition uses only probability distribution, not samples/outcomes.
Indeed, the entropy of the empirical distribution tends to the entropy of the underlying distribution, as N (number of throws) tends to infinity. But in our case N=20 is fixed.

chiro said:

The probability is the same, but the level of confidence is not

What do you mean by that?
Having a concrete sample - a sequence of 20 throws - does indeed allow you to change your confidence about some assumption, e.g. the hypothesis that the dice is fair, but this is a whole different problem.

chiro · Aug 11, 2011

Hurkyl said:

What is "confidence"? Is it anything other than "I know the math says one thing, but I don't want to believe it"? (edit: I don't mean to be condescending, but it is really easy to try and rationalize one's intuition when faced with the fact it's simply wrong)

The mistake I mentioned earlier -- here is one way to make that mistake:

I'm going to invent a statistical test: my statistic T is the entropy distribution of observed frequencies. Computing T for 1111... gives a result less likely than computing T for 6623... Therefore, I infer that 6623... is what was rolled

By confidence I mean that under a given assumption (in this case the number of trials, the theoretical assumed distribution) and based on the sample (our actual generated information), there is a distribution that formalizes our idea of inference for something to be x% probable.

As a concrete example, let's focus on a statistical parameter: the mean of a distribution. Now we have sample information which is obviously finite. We use the unbiased estimator to determine a good estimate of our mean.

Now we have assume that the underlying distribution of each of the samples (remember they are from the same distribution) is some distribution, let's say normal. So we use statistical theory to calculate the distribution of the normalized difference of the statistic and the parameter (so [x_bar - mu]/root(Var)) and this is normal because the distribution is a linear combination of normal distributions.

So based on this we construct a confidence interval given our assumptions and we find that we have a 95% interval that says that 95% of the time, the true difference between the statistic and the parameter would lie in this interval.

As we increase our sample size we expect the variance of this distribution to go to zero as more samples are added and this is intuitive since our estimate for our true mean (the parameter we are estimating) gets more accurate. This is basically a result of the strong law of large numbers and this kind of thinking is used to generate many classical statistics theory.

With the above in mind, I am proposing the same thing but instead of the mean parameter, I am using the parameter of entropy.

Like with the mean example, you create a distribution that let's you do an inference calculation based on the entropy statistic belonging to a distribution with a theoretical entropy (the parameter in this case). So using the analogue with the mean you get a sampling distribution that instead of being focused on measuring the mean, you instead are focuses on testing the inference that the sample corresponds to having an entropy (the statistics) corresponding to the theoretical entropy (the parameter).

You construct your sampling distribution, and based on that generate a confidence interval for some level of confidence. You base this on (1) the theoretical entropy (from your assumed distribution), (2) your statistic entropy (measured from your sample), and (3) the number of samples. You may actually need more than that depending on the distribution yourself, but let's stick to the basics.

Again like with your mean, your construct an interval based on your level of confidence, and then depending on the result accept or reject the null hypothesis.

Now with the uniform distribution (discrete) the entropy is always maximized, as n increases you would expect the distribution to tighten around the maximum. It is done much in the same way that the distribution of the mean tightens around a specific value.

So in finishing, confidence is basically using probability models to make inference on how "confident" you are on a sample meeting the criteria of some theoretical parameter, which in this case is entropy.

chiro · Aug 11, 2011

vladb said:

What do you mean by that?

See my above post.

vladb · Aug 11, 2011

So you have described a statistic (not necessarily a good one) to test some assumption/hypothesis.

Given some assumption/hypothesis about your stochastic process, your statistic T = "entropy of the empirical distribution" has some certain distribution. If in reality the realization of T takes a value that is very uncharacteristic of your assumed distribution (like 0 in case of 111...), we can, with certain confidence, conclude that the assumption/hypothesis about the the stochastic process was wrong, but this tells us nothing about the rolls (as in OPs question).

Do you see the difference between (what Hurkyl tried to point out):

I'm going to invent a statistical test: my statistic T is the entropy distribution of observed frequencies. Computing T for 1111... gives a result less likely than computing T for 6623... Therefore, I infer that 6623... is what was rolled

and

I'm going to invent a statistical test: my statistic T is the entropy distribution of observed frequencies. Computing T for 1111... gives a result less likely than computing T for 6623... Therefore, I infer the dice is not fair

?

chiro · Aug 11, 2011

Let us again look at the quote of the OP:

In theory, the results are equally likely. Both specify the number that must appear each time the die is rolled. (For example, the 10th number in the first series must be a 1. The 10th number in the second series must be a 3.) Each number—1 through 6—has the same chance of landing faceup.

But let’s say you tossed a die out of my view and then said that the results were one of the above. Which series is more likely to be the one you threw? Because the roll has already occurred, the answer is (b). It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s.

Now in particular focus on this statement:

It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s.

This is what I'm focusing on. I am talking about using a statistic and a sampling distribution to measure confidence.

Like I said in a previous post I agree that both possibilities are equally likely, but in terms of a hypothesis test, our measure of confidence will have preference towards some values against others, depending on things like how many random samples are used to compute the statistic.

I'm not talking about ordering

vladb said:

So you have described a statistic (not necessarily a good one) to test some assumption/hypothesis.

Why is entropy a bad statistic? It gives a quantifiable indicator of uncertainty and information content. In the context of the OP's question, it allows one to to give a reason why we might reject that a long term large information sample may not be from a specific distribution.

The OP asked about the quote of why all 1's are less likely and this statistic helps quantify that.

Given some assumption/hypothesis about your stochastic process, your statistic T = "entropy of the empirical distribution" has some certain distribution. If in reality the realization of T takes a value that is very uncharacteristic of your assumed distribution (like 0 in case of 111...), we can, with certain confidence, conclude that the assumption/hypothesis about the the stochastic process was wrong, but this tells us nothing about the rolls (as in OPs question).

With regards to confidence, yes that is correct. Note that if you have 100% confidence then your interval is from [0,log_2(N)] where N is the number of states that you can have (in this case 6), and this includes the case with all 1's.

Do you see the difference between (what Hurkyl tried to point out):

I'm going to invent a statistical test: my statistic T is the entropy distribution of observed frequencies. Computing T for 1111... gives a result less likely than computing T for 6623... Therefore, I infer that 6623... is what was rolled
and

I'm going to invent a statistical test: my statistic T is the entropy distribution of observed frequencies. Computing T for 1111... gives a result less likely than computing T for 6623... Therefore, I infer the dice is not fair
?

The second case is basically what I am describing. I am not saying that the 6623 was rolled (I don't know how you thought that), I am saying that with x% confidence, that I infer that the dice roll of 1111... means that it is not likely that our true process has the given entropy of log2(6) and that I infer that our assumed distribution is based on different assumptions.

Confidence intervals don't tell us what should happen: they tell us if under certain assumptions, how probable something is to happen with respect to some model.

Again if you have an issue with this, please explain why. Using a statistic and a sampling distribution is at the heart at statistical inference. Entropy quantifies uncertainty, information content, and information compressibility, and when the sample entropy is largely different from some expected theoretical entropy, then naturally you would ask the question "Are my assumptions right?". It's no different in making inferences about any parameter.

chiro · Aug 11, 2011

vladb said:

The standard definition uses only probability distribution, not samples/outcomes.
Indeed, the entropy of the empirical distribution tends to the entropy of the underlying distribution, as N (number of throws) tends to infinity. But in our case N=20 is fixed.

Yes you are right, but this behavior is mimicked for every statistic and sampling distribution. It's going to be out of whack for entropy than say for the sampling distribution of something like the mean, but the distribution for some value of N will reflect that.

What do you mean by that?
Having a concrete sample - a sequence of 20 throws - does indeed allow you to change your confidence about some assumption, e.g. the hypothesis that the dice is fair, but this is a whole different problem.

Maybe you understood me wrong, or maybe I just communicated the wrong message, but when I talk about confidence I mean exactly that. The inference you are making in this case is whether you think that the sample is representative of a distribution with maximal entropy, in exactly the same way as testing the inference that the results support a fair or unbalanced die.

chiro · Aug 11, 2011

Hurkyl said:

Another way to describe the mistake that might be useful:

The greater likelihood of getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's versus 20 1's is exactly canceled out by the unlikelyhood they happen to be arranged in exactly the order 66234441536125563152.

In this example conditional probabilities don't come into this and so order is irrelevant. If order had some impact, there would be a slew of conditional properties and the distribution for some number of rolls would reflect that.

chiro · Aug 11, 2011

Hurkyl said:

We're given the distribution: 20 successive rolls of a fair die. The probability is thus uniform over all 20-long sequences of the digits 1,2,3,4,5,6.

In particular, P(11111111111111111111) = P(66234441536125563152).

I agree with that statement.

I'm having trouble making sense of your post. At best, it seems to be a very long winded way of saying "Given a lot of independent trials, the observed frequencies tend towards the probability distribution the samples are drawn from".

But, as I've said before, the question is not:

Which histogram is more likely? (20,0,0,0,0,0) or (3,3,3,3,4,4)?

That is what I interpreted the OP's last question (or part of the question) to be in which case means we are not talking about the same thing.

pwsnafu · Aug 11, 2011

chiro said:

That is what I interpreted the OP's last question (or part of the question) to be in which case means we are not talking about the same thing.

I interpreted as: the reader specified a specific ordering. If it was an unordered question it would be stated as

Which of these is more likely: 20 ones; or 3 ones 3 twos 3 threes...

and so on. The fact that the reader didn't write the question in that format means parsing the question as ordered enumeration is the more likely choice.

chiro · Aug 12, 2011

Also I want to point out that for this particular example, the entropy is maximized with regard to the number of states since it is a pure random process.

For other distributions where this is not the case, it's probably not going to be as good an indicator, but for this case we would expect as n grows that if the real theoretical distribution was pure random, then the sampling distribution should be highly centered around the maximum and converge to that value.

One way to check if the sample models a specific distribution is to do it by moments, but depending on the distribution itself you may need to check quite a few of these. For the random process, entropy does a good job in characterizing uniform random distributions which is why I chose it.

vladb · Aug 12, 2011

It does seem a miscommunication problem. I was reading the original question as: given that the event {111... or 6623...} has occurred (where 6623.. was agreed beforehad), which one of the two is then more likely to have occured? Saying one is preferable (in terms of probability) to the other is wrong, but I now see it's not what you claimed.

chiro said:

Again if you have an issue with this, please explain why. Using a statistic and a sampling distribution is at the heart at statistical inference. Entropy quantifies uncertainty, information content, and information compressibility, and when the sample entropy is largely different from some expected theoretical entropy, then naturally you would ask the question "Are my assumptions right?". It's no different in making inferences about any parameter.

I am familiar with mathematical statistics and information theory:).
You can indeed infer something about your model/parameters/assumptions, but this is different from "inferring" something about what was actually rolled (in terms of conditional probabilities).

As for entropy of empirical distribution as a statistic - I didn't say it was bad, but it does have limited applicability. If you have non-uniform distributions, it is likely not be able to differentiate between hypotheses that are equally feasible (especially in physics experiments). In case of uniform distribution (as an assumption), this is not a problem, but then there are a number of other statistics, and I'm not sure at the moment, whether the entropy statistic would have better power.

chiro · Aug 12, 2011

Yeah to be honest I chose entropy on an intuitive hunch based on the fact that it is maximal under purely random processes.

I might look into deriving some sampling distributions based on the sampling statistic of entropy. Sounds like a nice exercise.

Hurkyl · Aug 12, 2011

chiro said:

Now in particular focus on this statement:

It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s.

This is what I'm focusing on.

Then you have completely missed the point of the opening post, which is that this part you're focusing on isn't an answer to the question she claimed she was answering.

Also, you are making things far too complicated; e.g. to answer the question

Say you plan to roll a die 20 times. Which of these results is more likely:
a) 20 1's, or b) 3 1's 3 2's 3 3's 3 4's 4 5's and 4 6's?

all you have to note is that there are more sequences that fit the second pattern than the first pattern, so the odds are 3259095840000:1 in favor of b.

(aside: I can't find any trace of her article on the internet -- just a couple of forums that have quoted parts of it. e.g. the link in the opening post just redirects me to Parade's front webpage )

phinds · Aug 12, 2011

Hurkyl, I admire your persistence in try to explain this.

The statement presented by the author quoted in the OP is unambiguous and clearly correct, but the brief manner of presenting an incomplete statement in the second observation clearly has caused a lot of confusion.

I believe the following statement is a way to say what the original author intended to say but which avoids the confusion and likely would not have even lead to this thread having been started in the first place.

"If you roll n dice the likelihood of your getting any SPECIFIC sequence is identical to the likelihood of your getting any OTHER specific sequence, but if I don't see you roll the dice and you tell me you got a pretty random jumble of numbers, I'm going to find that easy to believe because it's quite likely that's what will happen, but if you tell me you got something like all ones, I'm going to find that pretty odd because it's unlikely. On the other hand, if I gave you a random sequence of numbers and asked you to roll that SPECIFIC sequence, I would find your getting it to be exactly as unlikely as your getting all ones."

SW VandeCarr · Aug 12, 2011

phinds said:

Hurkyl, I admire your persistence in try to explain this... if I gave you a random sequence of numbers and asked you to roll that SPECIFIC sequence, I would find your getting it to be exactly as unlikely as your getting all ones."

Exactly. Why did it take 30 posts to explain this? The probability is [itex] 10^{-20}[/itex] for either sequence assuming independence and every digit is equally probable.

phinds · Aug 12, 2011

SW VandeCarr said:

Exactly. Why did it take 30 posts to explain this?.

Don't know, but that's why I admire Hurkyl's persistence.

OpenGates · Aug 12, 2011

Phinds got it right. This is apples and oranges. Probability is the apples and believability is the oranges.

The probability that you roll all ones is the same as the probability that you roll any other sequence.

But the believability that you roll all ones is lower than the believability that you roll a less aesthetic sequence.

DaveC426913 · Aug 12, 2011

OpenGates said:

Phinds got it right. This is apples and oranges. Probability is the apples and believability is the oranges.

The probability that you roll all ones is the same as the probability that you roll any other sequence.

But the believability that you roll all ones is lower than the believability that you roll a less aesthetic sequence.

I had a tough time convincing my brother that the lottery numbers 1,2,3,4,5,6,7 had exactly the same odds as any other selection of 7 specific numbers.

OpenGates said:

But the believability that you roll all ones is lower than the believability that you roll a less aesthetic sequence.

This is simply explained: in the bucket called "rolls of all 1's" there is exactly one result. In the bucket labelled "a less aesthetic sequence" there are many, many, many results. That's many, many, many possibilities that can match it.

SW VandeCarr · Sep 11, 2011

DaveC426913 said:

This is simply explained: in the bucket called "rolls of all 1's" there is exactly one result. In the bucket labelled "a less aesthetic sequence" there are many, many, many results. That's many, many, many possibilities that can match it.

I just wanted to correct any misconceptions that might arise from the above. There are not many possible matching results. We are talking about the probability of one unique sequence regardless of what it appears to be. For example, pi has just one correct sequence. There is no apparent pattern to the digit sequence, but the sequence is unique.

Is Marilyn Vos Savant wrong on this probability question?

Similar threads

Hot Threads

Recent Insights