Register to reply 
Is Marilyn Vos Savant wrong on this probability question? 
Share this thread: 
#1
Aug1011, 10:11 PM

P: 45

Hi everyone,
Someone posed the question: http://www.parade.com/askmarilyn/201...073111.html 


#2
Aug1011, 10:54 PM

P: 4,572

One intuitive idea that you can use to analyze this situation is entropy. In a system that is purely random (think uniform probability of any possibility occurring), then its more likely that you get a result which reflects this high entropy than something with a lower entropy. In the case of the first result with all 1's: this is something which has extremely low entropy. The fact that we have only one choice across all trials suggest that if this result is representative of the underlying process, then the process itself has a low measure of randomness (again represented by entropy). The fact that we are assuming that the process of dice rolling is purely random suggests that we would intuitively reject the result with near zero entropy and instead favor the result with entropy closer to the theoretical entropy of the process, which is the second result. Like she said, it doesn't mean that the sequence of 1's isn't possible. It just means that in the context of entropy, it is more likely that for a system that has a high amount of entropy, that you would expect a result that is somewhat related to the theoretical entropy of the system based on the fact that the assumptions used to calculate the theoretical entropy are good enough. We use this idea in physics to explain things like why plates don't just unbreak themselves spontaneously in the real world. When we observe physical systems we find that things do get more "random" and as such we have the expectation of seeing things that exhibit this property. So if you want to think about how you would judge whether you would believe one outcome over another in respect to a given system, think about the results in terms of the entropy of the given system as a good heuristic to gauge some level of likelihood. 


#3
Aug1011, 11:47 PM

P: 45

I'm unclear what entropy means in this context but it sounds interesting. This problem is unsettling because if we let A be the event with the sequence of all 1's, B be the event of the sequence with mixed numbers and let C = A U B. Then, we have P(A) = P(B) and therefore, [itex]P(AC) = \frac{P(A)}{P(A)+P(B)} = \frac{P(B)}{P(A)+P(B)} = P(BC)[/itex] So it seems they should have the same conditional probability. But maybe I'm not setting up these equations correctly. Also, how would we factor in the entropy you're speaking of into these equations? 


#4
Aug1011, 11:54 PM

P: 4,572

Is Marilyn Vos Savant wrong on this probability question?



#5
Aug1111, 12:27 AM

P: 1,402

I'm curious too. Could you explain what entropy does mean in this context, using this example. Wikipedia defines information entropy as a property of a random variable. Which random variable are you talking about the entropy of? What is its domain, what is its range, what is the rule that specifies its value for each input?
One obvious random variable we have here is the identity function on the set {n  n is an integer, 0 < n < 7}^{20}, whose distribution gives the (equal) probability of the two events {(6, 6, 2, 3, 4, 4, 4, 1, 5, 3, 6, 1, 2, 5, 5, 6, 3, 1, 5, 2)} and {(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)}. But presumably it's not this random variable whose entropy you mean. On the other hand, Wolfram Alpha and Mathematica treat entropy as a property of a list. I'm not sure what their definition is. Are these two kinds of entropy related? 


#6
Aug1111, 01:19 AM

P: 4,572

Basically higher entropy means more uncertainty which means more "randomness". In terms of how to calculate that, that is well documented in the literature. For applications in physics look at thermodynamics. The idea that things get more random as time passes and the fact that entropy increases is no accident. Entropy allows us to quantify this kind of behavior. With regards to the dice, the outcome of all 1's in terms of its entropy value is very close to zero. There is only one symbol and no real randomness in that information. On the other hand in the second sequence of numbers there is variation in symbol type which means higher entropy. Of both of these, the second sequence has an entropy a lot closer to the theoretical entropy than the first one, so if we were to judge whether it is "more likely" whether the second sequence is come from a particular process than the other, entropy is a quantitative variable that can be used to make an inference. But you also have to remember that even though there is a big difference in entropy, both results under that process are likely. Entropy does not dictate otherwise, but it does give a quantitative measure of the randomness of some process and hence gives us an indicator of how "likely" some piece of information is to come from a particular process. If you want a definitive definition of entropy (the one I have been talking about), get Claude Shannon's papers or the Information Theory book by Thomas Cover (latest edition as far as I know is the 2nd edition). The definition broadly is basically sum (or integral) over the domain of  p x log(p). 


#7
Aug1111, 02:06 AM

Emeritus
Sci Advisor
PF Gold
P: 16,099

Assuming your quote is fair (I can't see the article), Marilyn made an unfortunately common psychological mistake. The question she answered is very different from the question that was asked.



#8
Aug1111, 02:21 AM

P: 4,572

The entropy argument also fits in with your second scenario and not just the first. You base your entropy calculations on the number and the frequency of symbols that occur in your information sample. Both of these are incorporated into an entropy calculation and not just one or the other. 


#9
Aug1111, 02:31 AM

Emeritus
Sci Advisor
PF Gold
P: 16,099

I can guess what your line of reasoning is, and it boils down to you answering the question Which is more likely? 


#10
Aug1111, 02:41 AM

Emeritus
Sci Advisor
PF Gold
P: 16,099

Another way to describe the mistake that might be useful:
The greater likelihood of getting 3 1's, 3 2's, 3 3's, 3 4's, 4 5's, and 4 6's versus 20 1's is exactly canceled out by the unlikelyhood they happen to be arranged in exactly the order 66234441536125563152. 


#11
Aug1111, 02:56 AM

P: 4,572

Entropy as stated before gives us a quantitative measure of information content and uncertainty. The most uncertain source of information comes from a random variable where the distribution is uniform, and where new trials in the stochastic process are statistically independent from past trials in that given P(Xn = x  Xn1 = a(n1), Xn2 = a(n2), ..., X0 = a(0)) = P(Xn = x) which is the strongest form of independence there is. Now entropy requires us to supply a distribution and based on that we get our entropy in some base. In the uniform random variable (discrete), the entropy is always log2(n) where n is the number of possible states. Now if we have a process with a large number of trials, strong law arguments should allow us to prove that the entropy of the information source tends to the theoretical entropy of our assumed distribution. This sort of reasoning is used in the classical frequentist approach especially with sampling distributions but we typically derive properties for things like the distribution of a statistic and not something like entropy. Now I don't have a mathematical formula for the confidence interval for the inference for the entropy of a sample with respect to the entropy of our underlying distribution (but I'm thinking about doing the calculation myself a little later on), but based on the argument that if the number of trials (in this case dice rolls) is large enough, then the entropy of that sample (with respect to the distribution) should approach the theoretical entropy of the assumed distribution. Also with regard to order, I haven't said that order affects anything. That depends entirely on the distribution. You could write an extremely complex distribution up to trial n where the distribution is a really complicated mathematical expression that is conditional on every prior trial, or you could use the independence assumption (like the binomial, multinomial). I am simply basing my argument around the idea of creating a distribution with confidence intervals that are based on the assumption that as more information is created, the entropy of the entire information approaches the theoretical value more or less analogously in the same way that the variance of the the mean statistic of a distribution approaches zero and n goes to infinity. I haven't provided any proofs of anything and I am aware of that, but if I am wrong it would be nice to hear some justification by either a proof or simply a hole in my reasoning. 


#12
Aug1111, 03:06 AM

P: 4,572

If your using your standard purely random assumptions (in this case discrete uniform, any trial isn't affected by any other), then ordering won't have an affect on the entropy of your theoretic distribution. It's not that the scenario with all 1's is impossible, it's just that it becomes more unlikely as that keeps happening. Picture a confidence interval that is measuring the sample entropy against the theoretical entropy of the assumed distribution. As the information content gets more trials (and more symbols), it is expected that the entropy tends towards the theoretical value, and the shape of the sampling distribution in the context of entropy also shrinks (i.e. variance decreases). 


#13
Aug1111, 03:19 AM

Emeritus
Sci Advisor
PF Gold
P: 16,099

In particular, P(11111111111111111111) = P(66234441536125563152). I'm having trouble making sense of your post. At best, it seems to be a very long winded way of saying "Given a lot of independent trials, the observed frequencies tend towards the probability distribution the samples are drawn from". But, as I've said before, the question is not: Which histogram is more likely? (20,0,0,0,0,0) or (3,3,3,3,4,4)? 


#14
Aug1111, 03:22 AM

P: 737

This reminds me of how, in biology, my teacher emphasized, assuming equal probability of boy or girl, that even if you have 10 boys, there is equal probability between the possible genders of the 11th.
The fact that causes the biology example, and the die example to be true is that, if the preceding events don't affect the probability of subsequent events, then the preceding events are just as probable to happen again as the were. Concretely, the fact that what you roll has no effect on subsequent rolls means that rolling a 1 doesn't make rolling a 1 on your next roll any more or less probable. 


#15
Aug1111, 03:33 AM

P: 4,572

The reason why the first is zero is because P(X = 1) = 1 and log(1) = 0. Maybe I'm arguing something that I didn't intend to argue, so I'll be clear. What I'm arguing is that as you get more trials from your process, the entropy of your information sample approaches that the underlying distribution. Probabilistically, getting 111111111 and 635162122 are equally likely and I'm not disagreeing with that. What I'm saying is that I have more "confidence" with the second one coming from our stochastic process than the first. If I made an error in communication, then I hope it is cleared up now. The probability is the same, but the level of confidence is not, and that is why Vos Savant said that she preferred the second output vs the first. 


#16
Aug1111, 03:47 AM

Emeritus
Sci Advisor
PF Gold
P: 16,099

The mistake I mentioned earlier  here is one way to make that mistake: I'm going to invent a statistical test: my statistic T is the entropy distribution of observed frequencies. Computing T for 1111... gives a result less likely than computing T for 6623.... Therefore, I infer that 6623... is what was rolled 


#17
Aug1111, 04:40 AM

P: 42

Indeed, the entropy of the empirical distribution tends to the entropy of the underlying distribution, as N (number of throws) tends to infinity. But in our case N=20 is fixed. Having a concrete sample  a sequence of 20 throws  does indeed allow you to change your confidence about some assumption, e.g. the hypothesis that the dice is fair, but this is a whole different problem. 


#18
Aug1111, 06:27 AM

P: 4,572

As a concrete example, lets focus on a statistical parameter: the mean of a distribution. Now we have sample information which is obviously finite. We use the unbiased estimator to determine a good estimate of our mean. Now we have assume that the underlying distribution of each of the samples (remember they are from the same distribution) is some distribution, lets say normal. So we use statistical theory to calculate the distribution of the normalized difference of the statistic and the parameter (so [x_bar  mu]/root(Var)) and this is normal because the distribution is a linear combination of normal distributions. So based on this we construct a confidence interval given our assumptions and we find that we have a 95% interval that says that 95% of the time, the true difference between the statistic and the parameter would lie in this interval. As we increase our sample size we expect the variance of this distribution to go to zero as more samples are added and this is intuitive since our estimate for our true mean (the parameter we are estimating) gets more accurate. This is basically a result of the strong law of large numbers and this kind of thinking is used to generate many classical statistics theory. With the above in mind, I am proposing the same thing but instead of the mean parameter, I am using the parameter of entropy. Like with the mean example, you create a distribution that lets you do an inference calculation based on the entropy statistic belonging to a distribution with a theoretical entropy (the parameter in this case). So using the analogue with the mean you get a sampling distribution that instead of being focused on measuring the mean, you instead are focuses on testing the inference that the sample corresponds to having an entropy (the statistics) corresponding to the theoretical entropy (the parameter). You construct your sampling distribution, and based on that generate a confidence interval for some level of confidence. You base this on (1) the theoretical entropy (from your assumed distribution), (2) your statistic entropy (measured from your sample), and (3) the number of samples. You may actually need more than that depending on the distribution yourself, but lets stick to the basics. Again like with your mean, your construct an interval based on your level of confidence, and then depending on the result accept or reject the null hypothesis. Now with the uniform distribution (discrete) the entropy is always maximized, as n increases you would expect the distribution to tighten around the maximum. It is done much in the same way that the distribution of the mean tightens around a specific value. So in finishing, confidence is basically using probability models to make inference on how "confident" you are on a sample meeting the criteria of some theoretical parameter, which in this case is entropy. 


Register to reply 
Related Discussions  
12 year old autistic savant claims theory of relavitiy is wrong  Special & General Relativity  4  
What am I doing wrong? (Probability)  Calculus & Beyond Homework  0  
Math Savant  General Discussion  35  
Marilyn Monroe's quotes  General Discussion  16 