Is Marilyn Vos Savant wrong on this probability question?

mathwonk · Mar 9, 2012

If you think 1,1,1,1,1,1,1 has essentially no chance of occurring as the winning numbers in a lottery, then you have just answered why the lottery is not a good bet. I.e. every other choice is just as unlikely as this one in a fair lottery.

It is ironic that Ms. Vos Savant would make this simple mistake since she rode to fame on a probability question that stumped some mathematicians (including me) as follows:

Suppose there are three doors and a prize lies behind one of them, and you have one choice. After you indicate your preferred choice the moderator opens another door with nothing behind it, leaving two doors still closed, yours and one other. Then you have the opportunity of keeping to your original choice or changing it.

What should you do, and why?

chiro · Mar 9, 2012

mathwonk said:

If you think 1,1,1,1,1,1,1 has essentially no chance of occurring as the winning numbers in a lottery, then you have just answered why the lottery is not a good bet. I.e. every other choice is just as unlikely as this one in a fair lottery.

It is ironic that Ms. Vos Savant would make this simple mistake since she rode to fame on a probability question that stumped some mathematicians (including me) as follows:

Suppose there are three doors and a prize lies behind one of them, and you have one choice. After you indicate your preferred choice the moderator opens another door with nothing behind it, leaving two doors still closed, yours and one other. Then you have the opportunity of keeping to your original choice or changing it.

What should you do, and why?

I've said this before, but I think it's important to bring this up.

The differences IMO that Ms. Vos Savant is talking about is the comparison of an underlying process vs the estimation of process parameters using likelihood techniques based on existing data.

Hurkyl is right in saying that if the underlying process is random, then every combination will be as unlikely (or likely) as every other possibility. No argument there.

But an important thing that statisticians have to do is 'guess' the probabilistic properties of a stochastic process using data. For a process that is binomial we use things like MLE estimation and using this we get the estimator to be t/n +- std where t is the number of 'true' or 'heads' and n is the number of trials.

My guess is that Marilyn is talking about likelihood estimation in the very last statement as opposed to true underlying probabilistic properties that Hurkyl is referring to.

Again if the dice are really and truly from a purely random process then Hurkyl is right, but if we have to measure some kind of 'confidence' by taking existing data where we do not know the real underlying process and have to make a 'judgement' about the probabilistic properties of the process where we don't actually know them, then if a likelihood procedure was done on a space with 6 possibilities per trial with 20 trials and we get all 1's, then given this data we have to say that we are not 'confident' that this data comes from a process that is purely random.

It's important to see the distinction: the likelihood results do not say that it doesn't come from a particular process, but rather gives evidence for it either coming or not coming from a particular kind of process.

Statisticians have to do this kind of thing all the time: they get data and they have to try and extract the important properties of the underlying process itself. We don't often get the luxury of knowing the process in any great detail so what we do is we say 'this model looks good, let's try and estimate its parameters using the data'.

People have to remember that the probabilistic properties of the true underlying stochastic process that is known and the exercise of trying to measure distribution parameters for a process that is not known are two very different things.

One specifies properties for a process that is known and the other tries to 'figure out' using sound statistical theory 'what the specifics of the process should be given the data since we don't actually know the underlying process'.

Again, two very different things.

nucl34rgg · Mar 10, 2012

Both probabilities are equally likely.

On a side note, if I roll a fair dice 999999999999 times and get 1 each time, and I roll it again, the probability of rolling a 1 is still 1/6. (Empirically, we might dispute that the dice was fair, however! ;))

Here is a nice quote from Feynman:

"You know, the most amazing thing happened to me tonight. I was coming here, on the way to the lecture, and I came in through the parking lot. And you won't believe what happened. I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing!"

ParticleGrl · Mar 10, 2012

Everyone agrees the dice would roll both sequences with equal probability. Thats not the question being addressed in the second example.

The question being addressed in the second example is 'presented with two numbers, one of which was generated by rolling dice, one of which was generated with a different unknown process, which was more likely to be generated by the dice?'

In this case, I think many approaches will suggest the string of 1s is less likely to be the dice, but with only one data point and no information about the process generating the non-dice number, the predicted probabilities will always be close to 1/2 for each.

Hurkyl · Mar 10, 2012

chiro said:

But an important thing that statisticians have to do is 'guess' the probabilistic properties of a stochastic process using data. For a process that is binomial we use things like MLE estimation and using this we get the estimator to be t/n +- std where t is the number of 'true' or 'heads' and n is the number of trials.

My guess is that Marilyn is talking about likelihood estimation in the very last statement as opposed to true underlying probabilistic properties that Hurkyl is referring to.

The central limit theorem (CLT) is a great technique for predicting the mean of a large sample. The fallacious gambler uses it to predict the next outcome after a losing streak. That the CLT is a good technique for one purpose doesn't mean it's a good idea for the gambler to use it in a vaguely related situation.

When presented with the knowledge

Exactly one of

11111111111111111111
66234441536125563152

is real, and the other is fake

(and given the assumption that the specific question being asked is independent of your strategy for responding to it) there is exactly one reasons why you should predict that option (2) is the real one: you believe

(*) whatever process lead to you being faced with this question would produce this pair with all-1's being fake more often than with all 1's being real.

(also assuming your goal is to be right as often as possible)

Any approach you have to the question that, in the end, isn't aimed specifically at deciding whether (*) is true or not is fundamentally misguided.

aside: if you believe the generation of the fake is independent of the generation of the real, then (*) simplifies to

(*) the process that generates the fake is more likely to produce all 1's than it is to produce 66234441536125563152

chiro · Mar 10, 2012

Hurkyl said:

The central limit theorem (CLT) is a great technique for predicting the mean of a large sample. The fallacious gambler uses it to predict the next outcome after a losing streak. That the CLT is a good technique for one purpose doesn't mean it's a good idea for the gambler to use it in a vaguely related situation.

When presented with the knowledge

Exactly one of

11111111111111111111

66234441536125563152

is real, and the other is fake

(and given the assumption that the specific question being asked is independent of your strategy for responding to it) there is exactly one reasons why you should predict that option (2) is the real one: you believe

(*) whatever process lead to you being faced with this question would produce this pair with all-1's being fake more often than with all 1's being real.
(also assuming your goal is to be right as often as possible)

Any approach you have to the question that, in the end, isn't aimed specifically at deciding whether (*) is true or not is fundamentally misguided.

aside: if you believe the generation of the fake is independent of the generation of the real, then (*) simplifies to

(*) the process that generates the fake is more likely to produce all 1's than it is to produce 66234441536125563152

Hurkyl, do you know what likelihood techniques and parameter estimation is all about?

Like I said above, they focus on completely different things. The likelihood procedures are used to gauge what the parameters are for an assumed model given the data: you don't do it the other way around.

Likelihood procedures aren't perfect of course, but the point of them including parameter estimation is an intuitive concept that anyone can appreciate, not just a statistician.

As you say, if the dice roll process is truly purely random then there is no reason why everything is not equally likely but I'm afraid there is a huge caveat: we statisticians and scientists can't do this.

We have to use statistics and its methods to see how our hypotheses are backed up by the evidence which translates into analyzing the actual data. We have to check that the evidence supports the notion that the dice or the coin or whatever is what it is: we can't just say 'it's going to be equally likely': we have to do the experiment, get the right data and process it to see whether the data backs up our intuition.

You don't need to bring in the Central Limit Theorem or anything else: the idea is very basic and can be understood by anyone, statistician or non-statistician in a very simple way.

nucl34rgg · Mar 10, 2012

ParticleGrl said:

Everyone agrees the dice would roll both sequences with equal probability. Thats not the question being addressed in the second example.

The question being addressed in the second example is 'presented with two numbers, one of which was generated by rolling dice, one of which was generated with a different unknown process, which was more likely to be generated by the dice?'

In this case, I think many approaches will suggest the string of 1s is less likely to be the dice, but with only one data point and no information about the process generating the non-dice number, the predicted probabilities will always be close to 1/2 for each.

I don't understand how that question is the question that arose in the second part.

Here is what is being said:
"But let’s say you tossed a die out of my view and then said that the results were one of the above. Which series is more likely to be the one you threw? Because the roll has already occurred, the answer is (b). It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s."

This does not relate to the first statement. The roll sequence is more likely to produce a string of mixed numbers. However, what we have here is a choice between two specific strings of numbers. Her conclusion, "thus, the answer is (b)" is false. Everything else that she said is technically fine, but largely irrelevant. The probability that the sequence is a mixed sequence of numbers is not the same thing as the probability that the sequence is a PARTICULAR mixed sequence of numbers.

ParticleGrl · Mar 10, 2012

"But let’s say you tossed a die out of my view and then said that the results were one of the above. Which series is more likely to be the one you threw? Because the roll has already occurred, the answer is (b). It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s."

This does not relate to the first statement. The roll sequence is more likely to produce a string of mixed numbers. However, what we have here is a choice between two specific strings of numbers. Her conclusion, "thus, the answer is (b)" is false. Everything else that she said is technically fine, but largely irrelevant. The probability that the sequence is a mixed sequence of numbers is not the same thing as the probability that the sequence is a PARTICULAR mixed sequence of numbers.

Right, so whoever rolled the dice is presenting you with two choices a and b. One of them was generated by the dice roll, one was generated by an unknown process (you don't know where the alternative number came from).

So the question boils down to: given the strings of numbers 66234441536125563152, and 11111111111111111111, which one was more likely to gave been generated by a dice roll?

Which is related to a similar question- how many times do you have to roll 1 in a row before you start to wonder if your dice is biased?

nucl34rgg · Mar 10, 2012

I think those two strings are equally likely to be produced by a sequence of dice rolls.

However, we could use a chi-square test to show that obtaining a string of 1's suggests with a high probability that the dice was loaded.

ParticleGrl · Mar 10, 2012

I think those two strings are equally likely to be produced by a sequence of dice rolls.

Yes, literally every single person on this thread agrees with you. Thats not what the second question is asking. Its asking

GIVEN one of these strings was produced by dice and was not, which was more likely produced by dice?

However, we could use a chi-square test to show that obtaining a string of 1's suggests with a high probability that the dice was loaded.

Which is what I'm getting at, the string of 1s is less likely to be the dice generated string.

Bacle2 · Mar 11, 2012

micromass said:
Firstly, my code written in Scheme:
Code:
(define (MakeRandomList)
  {local [(define (MakeRandomList-iter n)
            {local [(define x (+ (random 2) 1))]
              (if (= n 0)
                  (list)
                  (cons x (MakeRandomList-iter (- n 1))))})] 
    (MakeRandomList-iter 10)})

(define (ListEqual List1 List2)
  {local [(define (ListEqual-iter l1 l2)
            (if (empty? l1)
                true
                (and (= (car l1) (car l2)) (ListEqual-iter (cdr l1) (cdr l2)))))]
    (ListEqual-iter List1 List2)})

(define list1 (list 1 1 1 1 1 1 1 1 1 1))
(define list2 (list 1 2 1 2 1 1 1 2 1 2))


(define (Test n)
  {local [(define (Test-iter n amount1 amount2)
            {local [(define CurrentList (MakeRandomList))]
              (if (> n 0)
                  (if (ListEqual CurrentList list1)
                      (Test-iter (- n 1) (+ amount1 1) amount2)
                      (if (ListEqual CurrentList list2)
                          (Test-iter (- n 1) amount1 (+ amount2 1))
                          (Test-iter (- n 1) amount1 amount2)))
                  (list amount1 amount2))})]
    (Test-iter n 0 0)})

(Test 1000000)
A disclaimer first: the original post worked with "rolling the dice 20 times". This is unfeasable. Therefore, I changed the problem to "flipping a coin 10 times".

I worked with the two sequences 1111111111 and the supposedly random sequence 1212111212.

Now, what I did was:
Each test, I flip a coin 10 times. If the result is not one of the two sequences above, I discard the test. If the result is one of the two sequences above, I add 1 to the amount of times I saw the sequence.
This I do a million times.

Why is this a good representation of the test?
The original test was that I flip a coin 10 times. Then I get a choice which one of the above sequences was rolled. Of course, to get that very choice, I actually need to get one of the sequences. This is why every experiment where I do NOT get one of the sequences, I discard it.

After I got one of the sequences, I can choose which one of the sequences I get. Adding 1 to the amount of times I saw sequence 1 corresponds to getting it right if you guessed 1. Adding 1 to the amount of times I saw sequence 2 corresponds to getting it right if you guessed 2.
Eventually, the two amounts correspond to the number of times you got it right.

So, after iterating it a million times, I get
Sequence 1: 948
Sequence 2: 995

A subsequent test yielded:
Sequence 1: 1015
Sequence 2: 1001

These two are so close together that it seems plausible that the actual amount you get things right is indeed 50-50. Running it more than 1000000 times will only reinforce this, but I don't got the time to do so.

If that's the way you chose to decide the issue, maybe you can run a significance test on each of the differences 995-948 and 1015-1001 . I think it will pass, i.e., be accepted, at just-about any significance level.

Bacle2 · Mar 11, 2012

mathwonk said:

If you think 1,1,1,1,1,1,1 has essentially no chance of occurring as the winning numbers in a lottery, then you have just answered why the lottery is not a good bet. I.e. every other choice is just as unlikely as this one in a fair lottery.

It is ironic that Ms. Vos Savant would make this simple mistake since she rode to fame on a probability question that stumped some mathematicians (including me) as follows:

Suppose there are three doors and a prize lies behind one of them, and you have one choice. After you indicate your preferred choice the moderator opens another door with nothing behind it, leaving two doors still closed, yours and one other. Then you have the opportunity of keeping to your original choice or changing it.

What should you do, and why?

Mathwonk: a good point can be made that the reason why the problem trumped a large number of people is that the problem was not well-posed--just like this last one, where I think Vos-Savant could have made more of an effort to avoid potential ambiguities in her description, i.e., to specify the layout in such a way that alternative descriptions , understanding of her layout is less likely.

andrewr · Mar 13, 2012

Bacle2 said:

Mathwonk: a good point can be made that the reason why the problem trumped a large number of people is that the problem was not well-posed--just like this last one, where I think Vos-Savant could have made more of an effort to avoid potential ambiguities in her description, i.e., to specify the layout in such a way that alternative descriptions , understanding of her layout is less likely.

I agree with you; Marilyn's responses appear to be characteristically confusing; I find myself wondering if she is purposely trying to trip up certain intelligent people...

Of course, that wonder is just an automatic reaction of mine.. and not a considered opinion. Upon thinking about her response a bit more, I notice that Marilyn evokes thoughts (in my eyes) to normal but confusing "women" conversations.

I don't think it uncommon for men, like myself, to infer different priorities of meaning than the women actually involved in such men-exclusive conversations. (I note Loren hasn't yet responded again, and I am only noticing one other female respondent entering the melee... GOOD for her! )

I do agree that Marilyn has a command of the English language which makes her somewhat liable to judgment; eg: the IQ tests she took were heavily biased by men writers at the time...

However, I know that judging her wrong based on a manly interpretation (solely) is likely an injustice (which is why I don't personally care to do it ? ).

Marilyn might be careless, tired, annoyed with a leading question, or something along those lines; However, if even the original auditor (?Loren?) really did not understand Marilyn's nuances -- then Marilyn has made a true "faut pas" where she ought to know better *intuitively*.

andrewr · Mar 13, 2012

micromass said:

Can you post an outline of your program in pseudocode, please??

Python is often equated with pseudo-code

.

The program itself is fairly lengthy because I tried to include several different interpretations of the question. (Liars included, I already have someone claiming to have gotten odds on the 20x dice throw by outguessing the python random generator intuitively! It's possible, but How do I check if he told me the truth?)

Which sub-section do you want me to outline? I can edit the "if" statements out such that only your question is listed with comments / pseudo-code. I tried *VERY* hard to comment the program thoroughly (It is well over 50% comments...) Eg: Here's the tiny 3 shell game (edited only to improve my spelling and remove a game irrelevant print statement)

Code:

def ShellGame(_):
	"""
	Play a "Three then eject one Spam can shell game!"
	Load one of three cans with spam on a table, then let user pick one.
	Then, Pythonically eject a different but empty shell off the table
	when the customer finishes choosing. (please miss the customer!)
	Apologize profusely, and do allow (subject to Marilyn shell game 
	rules), the upset customer to re-choose among the two remaining shells.

	Marilyn maintains that the probability of getting a prize from the
	two remaining shells is not 50%/50% *depending* on your *aposteriori*
	choosing choice method.  That is the point of the test.

	This demonstrates how I learned about aposteriori probability changes 
	over 18 years ago, when I lost a serious bet to a mate of mine.
	Learn from my mistakes (!) as the voice of casino experience says *ouch*

	:)
	"""
	onTable={ 1:"Nothing", 2:"Nothing", 3:"Nothing" } # Set shells on table

	global usedDice, usedCoins # Casinos Keep track of used dice and coins.
	whichShell, usedDice = DiceToShell( dicePool[usedDice] ), usedDice+1
	onTable[whichShell]="SPAM AND SpAm sPaM SpAm" # Fill one shell randomly 

	def GetOnTable(): # Pick a shell subroutine
		menu=[]
		print "Shell game Table menu: appetizer is in ONE of:"
		for i,j in onTable.items():
			print "shell and tin can *",i,"*, ",
			menu.append(i)
		print "\n"
		choice,_=GetChoice(menu)
		return choice

	choice=GetOnTable() # first, Let the customer calmly choose a shell

	# Flip a coin in preparation for violent ejaculation
	headsOrTails,usedCoins = coinPool[usedCoins], usedCoins+1

	# Do a quick posteriori analysis for a dangerous random eject
	if onTable[choice]=="Nothing": headsOrTails="Heads"# remove possibility 

	for i,j in onTable.items():
		if i==choice: continue # Don't ever remove the user's choice!
		if j=="Nothing":  # This is empty, and thus not illegal to bomb 
			if headsOrTails == "Heads": # Randomly set off a shell 
				del onTable[i] # Shell is NOW GONE off table.
				print "\nI'm SO Sorry!"
				print "Can.an.d shell #",i," Just BLEW off!"
				print "Thankfully it held no prize!"
				print "Whew, now, you may re-choose for prize"
				print
				break
			else:
				headsOrTails=ReverseCoin( headsOrTails )

	# There are now TWO shells left, and the prize is still available 
	# Notice there was NO SWITCHEROO.  Just a safe pointless ejaculation.
	# Now the customer still doesn't KNOW where the item is, they guess ??
	
	choice=GetOnTable()
	print "You found ",onTable[choice]," for a prize"
	return ( onTable[choice] != "Nothing" )	# Win Spam=True, Nothing=False
# END of shell game.

How can I improve the code to make it more readable for you?

I don't know scheme off the top of my head, but I can attempt to translate (crudely) the part of the Casino you are interested in; Just let me know which part of (MarilynCasinoPack.py), and give me some time to read the specifications of scheme.

Loren Booda · Mar 13, 2012

andrewr said:

(I note Loren hasn't yet responded again, and I am only noticing one other female respondent entering the melee... GOOD for her! )

And good for him -- Loren. One less woman.

I must have some kind of dyslexia in trying to respond to posts.

I don't always agree with Marilyn, and this puzzle's answer I also find non-intuitive -- but similar to the Monty Hall paradox.

Bacle2 · Mar 13, 2012

andrewr said:

I agree with you; Marilyn's responses appear to be characteristically confusing; I find myself wondering if she is purposely trying to trip up certain intelligent people...

Of course, that wonder is just an automatic reaction of mine.. and not a considered opinion. Upon thinking about her response a bit more, I notice that Marilyn evokes thoughts (in my eyes) to normal but confusing "women" conversations.

I don't think it uncommon for men, like myself, to infer different priorities of meaning than the women actually involved in such men-exclusive conversations. (I note Loren hasn't yet responded again, and I am only noticing one other female respondent entering the melee... GOOD for her! )

I do agree that Marilyn has a command of the English language which makes her somewhat liable to judgment; eg: the IQ tests she took were heavily biased by men writers at the time...

However, I know that judging her wrong based on a manly interpretation (solely) is likely an injustice (which is why I don't personally care to do it ? ).

Marilyn might be careless, tired, annoyed with a leading question, or something along those lines; However, if even the original auditor (?Loren?) really did not understand Marilyn's nuances -- then Marilyn has made a true "faut pas" where she ought to know better *intuitively*.

Well, the cynic in me believes that controversy, however artificial, is good publicity for her site , and for her, but I don't have any real/hard evidence to support the belief that she's purposefully being ambiguous.

mathwonk · Mar 13, 2012

I apologize for being one of the people who, by their error in solving her probability problem, helped give prominence to Ms. Vos Savant.

Unfortunately she has parlayed this incident into a notoriety that is mostly undeserved, at least in regard to mathematics, of which she is largely ignorant.

being smart, even really really smart, does not translate into understanding an old and complicated subject.

here is a review by a friend of mine of one of her almost worthless books on a mathematical topic of some interest.

http://www.dms.umontreal.ca/~andrew/PDF/VS.pdf

Bacle2 · Mar 13, 2012

No problem; I have fallen thru plenty of Mathematical potholes myself.

Hurkyl · Mar 14, 2012

chiro said:

Hurkyl, do you know what likelihood techniques and parameter estimation is all about?

Yes, actually. They don't apply to the question we're considering.

If we had a model of how the person was choosing the fake results, we could take this sample (and ideally many more) and work out a posterior distribution on the parameters of the model we don't know.

But that's not what we're doing. We're faced with two alternatives A and B, and we need to decide whether P(A is real) > P(B is real), conditioned on the fact we are faced with {A,B}. I.E. we need to determine which of:

P(B would be generated as fake, given that A was rolled)
P(A would be generated as fake, given that B was rolled)

is larger. If we had a model of the fake is generated or some other way of estimating these probabilities, we could apply that to infer which is more likely to be real. If we don't have such a thing, then we have to come up with one.

You don't need to bring in the Central Limit Theorem or anything else:

My CLT example is one of someone using a statistical tool wrongly, and deriving nonsensical results.

Jorriss · Mar 14, 2012

Can't believe I read the entire thread.

chiro · Mar 15, 2012

Hurkyl said:

Yes, actually. They don't apply to the question we're considering.

This is where I disagree.

The point of the last example is that we don't know the process and therefore don't know the distribution. You can't just calculate probabilities for an unknown process.

When you have this example you need to use an estimator to estimate the parameters and to do this you need to use the data.

Again you can't just calculate probabilities because you don't actually know them: you need to make an inference of what they could be based on the data that has been sampled.

We assume that the process has six probabilities that add up to 1 and that we have a completely independent process, but beyond that we don't know anything: we can only infer what the actual characteristics of the process are by looking at the data and making some kind of inference: not the other way around.

andrewr · Mar 18, 2012

Loren Booda said:

And good for him -- Loren. One less woman.

I must have some kind of dyslexia in trying to respond to posts.

I don't always agree with Marilyn, and this puzzle's answer I also find non-intuitive -- but similar to the Monty Hall paradox.

All the Loren's I know are women, oh well! I wonder how many Lorens Marilyn knows...

Marilyn's commentary and the Monty Hall problem: as far as I know, are identical.
There was an extension, if I remember correctly to 4, 5, 6 shell games -- but that's really trivial in any event... It just shifts the probability down a notch for each shell.

Hurkyl · Mar 18, 2012

chiro said:

This is where I disagree.

The point of the last example is that we don't know the process and therefore don't know the distribution. You can't just calculate probabilities for an unknown process.

Right. If you don't have any priors, you can't do statistical inference. You can gather and analyze data, and tabulate whatever evidence you can extract from the data, but you cannot use that evidence to infer whether some hypothesis is more likely than some other hypothesis.

You have to have prior probabilities if you want to do statistical inference -- even if it's just a blind assumption of uniform priors of some sort.

When you have this example you need to use an estimator to estimate the parameters and to do this you need to use the data.

You said we don't know the process -- we don't have any parameters to estimate! :tongue:

If you have a prior assumption about the data generation -- e.g. that it's generated by some parametrized process and you have flat priors on the parameters -- then we could try to estimate parameters. We could then take the parameter with the highest posterior probability and see what distribution that produces on the thing we're actually interested in...

but then we would be doing things wrong. When you string together ideas in an ad-hoc fashion, rather than in a way aimed at solving the problem you're actually trying to solve, you get poor results.

If we remember what we're actually trying to solve, we would know to factor in information from all parameters, and could do so directly without having to deal with parameter estimation as an intermediary:
[tex] P(A \mid O) \propto \sum_\theta P(A \wedge O \mid \theta) P(\theta)[/tex]
Where A is the hidden value we're trying to predict, O is the observation we saw, and [itex]\theta[/itex] is the parameter. The most likely value of A is the one that maximizes the sum on the right hand side.

(the constant of proportionality is the same for all A)

Incidentally, in the special case that, for each [itex]\theta[/itex], [itex]A[/itex] and [itex]O[/itex] are independent, this simplifies to
[tex] P(A \mid O) = \sum_\theta P(A \mid \theta) P(\theta \mid O)[/tex]
(equality, this time) One could interpret this as saying, in this special case, that we can get the probability of A given our observation by first using O to get posterior probabilities for [itex]\theta[/itex], and then remembering to incorporate information from all [itex]\theta[/itex], weighted appropriately.

chiro · Mar 18, 2012

Hurkyl said:

You said we don't know the process -- we don't have any parameters to estimate! :tongue:

We know that there are six probabilities and that each probability is to be assumed independent for each trial. We have a model, but we don't have the distribution: there is a difference.

It's not a fair characterization to say what you said: we know the probability model for a coin flip and we use the data to get a good statistical estimate for P(Heads) and P(Tails) by using an appropriate procedure.

You can't just say things like that.

The thing is that typically we assume independence for each trial, which ends up simplifying the general case very nicely. By assuming each trial is completely independent we don't have to use the complex general procedures that we would otherwise have to use. The assumption P(A and B) = P(A)P(B) and P(A|B) = P(A) for all appropriate events A and B makes it a lot easier.

We know what the model is, we just don't know its parameters and the point of the exercise is to estimate them.

Saying that we don't have any parameters to estimate is just really ignorant.

Hurkyl · Mar 18, 2012

chiro said:

You can't just say things like that.

Then why did you? You can't complain about a proper analysis of the problem because "we don't know the process" and then turn right around and justify your sloppy approach by making very strong assertions about the process. :grumpy:

Hurkyl · Mar 18, 2012

And your analysis doesn't even look like the problem we were considering anyways. Did you start considering a very different problem?

For reference, the problem was essentially:

We are given two 20-long sequences of numbers. One of them is "real", generated by rolling a fair die. One of them is "fake", selected by our opponent. Our goal is to guess which sequence is real.

chiro · Mar 18, 2012

Hurkyl said:

Then why did you? You can't complain about a proper analysis of the problem because "we don't know the process" and then turn right around and justify your sloppy approach by making very strong assertions about the process. :grumpy:

Well if you want the absolute explicit description, then we know the model (or we assume one) but we don't know the parameters. Is that ok?

Our model is that every roll has 6 possibilities. Furthermore we assume that every roll is independent. This is a multinomial distribution with 6 choices per trial.

This is the model we assume for a die balanced (all probabilities per trial are equal) or not (all probabilities are not equal per trial).

Now a balanced die is assumed to have all probabilities equal per trial (1/6). An unbalanced one is not.

Marilyn said in her statement that if someone rolled all 1's out of her view and then was told the result she would not believe it came from a fair die.

This translates to all probabilities per trial (or throw) are the same 1/6.

Now if we talk about a die, whatever the probabilities are, if we were going to try and estimate the parameters of the die, we would for all practical purposes assume that each throw is independent and has the same distribution.

We don't know what the distribution is, but we have for practical purposes added enough constraints to be able to figure them out.

We know that there are only six possible choices per throw: no matter what can happen this has to be true. We assume independence of each throw or trial. This simplifies all the conditional statements about the model and makes it very manageable.

Now we get the data and we estimate the parameters based on this model. We look at the data and not surprisingly if we did a likelihood estimation procedure for the parameters given this data, we would conclude that under the constraints of the model the process that generated the data (i.e. the die) was not a balanced one (i.e. all probabilities are the same).

The assertions of the process are made on the grounds that each trial/throw is independent. The six possibilities per trial are definite since there really are only six possibilities per trial.

Would you use another set of constraints for this model? If so why?

chiro · Mar 18, 2012

Hurkyl said:

And your analysis doesn't even look like the problem we were considering anyways. Did you start considering a very different problem?

For reference, the problem was essentially:

We are given two 20-long sequences of numbers. One of them is "real", generated by rolling a fair die. One of them is "fake", selected by our opponent. Our goal is to guess which sequence is real.

It does! I'll post the specific problem that I am referring to. Here is a word-for-word quote in the original post:

In theory, the results are equally likely. Both specify the number that must appear each time the die is rolled. (For example, the 10th number in the first series must be a 1. The 10th number in the second series must be a 3.) Each number—1 through 6—has the same chance of landing faceup.

But let’s say you tossed a die out of my view and then said that the results were one of the above. Which series is more likely to be the one you threw? Because the roll has already occurred, the answer is (b). It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s.

I'm referring to the bolded part. Marylin is given data for a process which we assume has the properties of the die (hence my assumptions above) and she has to make up her mind whether the die is fair (all probabilities = 1/6) or not fair (they don't all equal 1/6).

Now again we can't assume that all probabilities = 1/6. We are given the constraints for a probability model (6 events per trial, all trials independent) and we have to take the data and estimate intervals for the parameters (i.e. 5 different probabilities since the 6th is the complement).

We can't just assume the data came from a fair die: we have to get the data and use that to estimate the parameters of a multinomial distribution.

The assumptions that lead to the constraints are based on some well accepted properties for these kinds of processes: coin flips, dice rolls and so on. I didn't just make this stuff up: it's based on independence of events and many people agree (including statisticians) that while it is not a perfect set of constraints, it suits its purpose rather well.

Now her terminology is not that accurate with regard to a 'mixed bunch of numbers', but you could formulate that mathematically and show that her argument holds a lot of water.

So again to conclude: Marilyn gets the data for a dice roll with each digit being 1,2,3,4,5 or 6. She gets a big string of 1's. She has to decide whether this data came from a fair die (all probabilities = 1/6) or a not so fair die (complement of this). Using some accepted properties of things like dice rolls (independence) she has a multinomial model for the data and needs to estimate its parameters. With all 1's unsurprisingly she rejects the hypothesis that the data that produced the process was something that would be defined as a fair die and from that says what she said.

chiro · Mar 18, 2012

Hurkyl said:

Right. If you don't have any priors, you can't do statistical inference. You can gather and analyze data, and tabulate whatever evidence you can extract from the data, but you cannot use that evidence to infer whether some hypothesis is more likely than some other hypothesis.

You have to have prior probabilities if you want to do statistical inference -- even if it's just a blind assumption of uniform priors of some sort.

If you want to go into the Bayesian way of thinking, then assume the prior is flat. By doing this we don't get any information that would otherwise give us a better advantage for parameter estimation of the multinomial distribution.

If you have a prior assumption about the data generation -- e.g. that it's generated by some parametrized process and you have flat priors on the parameters -- then we could try to estimate parameters. We could then take the parameter with the highest posterior probability and see what distribution that produces on the thing we're actually interested in...

but then we would be doing things wrong. When you string together ideas in an ad-hoc fashion, rather than in a way aimed at solving the problem you're actually trying to solve, you get poor results.

If we remember what we're actually trying to solve, we would know to factor in information from all parameters, and could do so directly without having to deal with parameter estimation as an intermediary:
[tex] P(A \mid O) \propto \sum_\theta P(A \wedge O \mid \theta) P(\theta)[/tex]
Where A is the hidden value we're trying to predict, O is the observation we saw, and [itex]\theta[/itex] is the parameter. The most likely value of A is the one that maximizes the sum on the right hand side.

(the constant of proportionality is the same for all A)

Incidentally, in the special case that, for each [itex]\theta[/itex], [itex]A[/itex] and [itex]O[/itex] are independent, this simplifies to
[tex] P(A \mid O) = \sum_\theta P(A \mid \theta) P(\theta \mid O)[/tex]
(equality, this time) One could interpret this as saying, in this special case, that we can get the probability of A given our observation by first using O to get posterior probabilities for [itex]\theta[/itex], and then remembering to incorporate information from all [itex]\theta[/itex], weighted appropriately.

I'm pretty sure I've addressed these issues indirectly but I'll comment briefly on this reply.

If we use the independence/multinomial assumption, a lot of this can be simplified dramatically. Again the multinomial distribution assumption for a die is used because it is a lot more manageable than attempting to factor in all of the conditional behaviour that while may happen, is assumed to be not as important in terms of the descriptive characteristics of the process. I'm not saying that these couldn't occur, it's just that the model is accepted to be a decent enough approximation and this makes life easier.

I am aware of the differences of the Bayesian and Classical approaches with the effects of priors and in this specific case (like when you have only one value appearing in your sample) you can get some weird things when take the classical approach, but is getting sidetracked.

If you want to take into account conditional statements and you can't for one reason or another assume independence like you do in binomial or multinomial distributions, then your likelihood is going to go nuts in comparison to something like these but all I have done is to fallback to these models because it's a well accepted constraint to use, intuitive to understand and simple to make use of, that's all.

Hurkyl · Mar 19, 2012

Marilyn said:

But let’s say you tossed a die out of my view and then said that the results were one of the above. Which series is more likely to be the one you threw? Because the roll has already occurred, the answer is (b). It’s far more likely that the roll produced a mixed bunch of numbers than a series of 1’s.

chiro said:

I'm referring to the bolded part. Marylin is given data for a process which we assume has the properties of the die (hence my assumptions above) and she has to make up her mind whether the die is fair (all probabilities = 1/6) or not fair (they don't all equal 1/6).
...
So again to conclude: Marilyn gets the data for a dice roll with each digit being 1,2,3,4,5 or 6. She gets a big string of 1's. She has to decide whether this data came from a fair die (all probabilities = 1/6) or a not so fair die (complement of this).

Did you notice you've significantly changed the problem? I get the impression you've fixated on one method of approaching the problem so strongly that you're having trouble acknowledging any other aspects of the situation.I need you to understand the following five problems are different problems:

Here are two sequences, one real, one fake. The real one is generated by a fair die roll. The fake one is generated by the person asking the question. Which one is real?
Here are two sequences. Given the hypothesis that one of them was generated by rolling a fair die, which one is more likely to be the one rolled?
Here are two sequences. Which one is more likely to be generated by rolling a fair die?
Here are two histograms. Which one is more likely to be generated by rolling a fair die?
Here is a sequence. Was it generated by a fair die roll?
Here is a sequence generated by die roll. Is the die fair?

(I fibbed slightly -- problems #2 and #3 are pretty much the same problem)

The original problem was problem #2. Marilyn modified the problem to turn it into problem #1, and was criticized for confusing problem #1 with problem #4.

You, I think, are trying to solve problem #4 too, but you're solving it by pretending it is two instances of problem #5, but the work you're describing is for solving problem #6.

That last thing is one of the things I'm criticizing. People make very serious blunders by pretending like that. There's one situation I recall vividly: there was a gaming community that was trying to test whether some character attribute had any effect on the proportion of success. They gathered data that supported the hypothesis with well over 99% confidence... but they spent years believing there was no effect because some vocal analysts made a substitution similar to what you did:

We want to test if proportion 1 is bigger than proportion 2, right? Well, let's estimate the two proportions. (Compute two confidence intervals) The confidence intervals overlap, so the data isn't significant.

Whereas if they had done a test that was actually designed to answer the question at hand (a difference between proportions test), they would have seen the result as very significant.Problem #5 is of a typical philosophically interesting type, because we can't talk about the probability of the answer. We can't even give an answer of the sort "yes is more probable than no". We can, however, choose a strategy to answer the question such that if the true answer is "yes", then we will be correct over, e.g., 95% of the time.But all of that aside, the main thing you're missing about problem #1 (and problem #6) that makes it very different from problems #2 through #5. We're not trying to answer questions about a single "process": we have two different processes, and we're trying to decide which processes produced the output we have. True, it can be difficult to get precise or accurate information about one of the processes, but that doesn't change the form of the problem.

(#6 and #1 are different because #6 has a single output and we're trying to guess which among many processes generated that output, and #1 has two processes with two outputs, and we're trying to say which one goes with which)

____________________________________All that aside, if we tried to use your strategy to solve problem #1, you will have a low probability of success against many people: it is a well-known tendency for humans to generate fake data that is *too* uniform. For example, 66234441536125563152 is 1.5 standard deviations too uniform by the test I did. So, when you take the real and fake data, decide what bias is most likely on the die, and compare to fair, you will pick the overly uniform fake data over the randomly generated data most of the time.

Any question of the form <anything> versus 11111111111111111111 is very unlikely to ever come up except against a human opponent who is likely to make that sort of bluff, so your mis-analysis won't cost you much in this case. However, it will cost you big-time by picking the overly-uniform data too much.

SidBala · Mar 19, 2012

I haven't read through the thread. But in short, she is right.

Any one valid string of dice rolls is just as probable as any other.

So what are people talking about for 9+ pages?

chiro · Mar 19, 2012

Hurkyl said:

Did you notice you've significantly changed the problem? I get the impression you've fixated on one method of approaching the problem so strongly that you're having trouble acknowledging any other aspects of the situation.

If I did that it was completely intentional: like I said in the quote, I focused on what the quote said literally and I interpreted it to be what I said.

I already acknowledged that the other part of the question which has been addressed is fair: I agree with your stance on probabilities being equal and all the rest of that which has been discussed in depth.

Again, I'm not trying to hide anything: I just looked at the quote and interpreted it to mean what it meant in the way that I described.

I thought I made it clear when I was talking about parameter estimation, but I think that perhaps I should have been clearer. I'll keep that in mind for future conversations.

I need you to understand the following five problems are different problems:

Here are two sequences, one real, one fake. The real one is generated by a fair die roll. The fake one is generated by the person asking the question. Which one is real?

Here are two sequences. Given the hypothesis that one of them was generated by rolling a fair die, which one is more likely to be the one rolled?

Here are two sequences. Which one is more likely to be generated by rolling a fair die?

Here are two histograms. Which one is more likely to be generated by rolling a fair die?

Here is a sequence. Was it generated by a fair die roll?

Here is a sequence generated by die roll. Is the die fair?

For what I was talking about I was only concerned with the problems where a sequence was given. Again I thought I made that very clear. I am, as you have pointed out, addressing the last point in the list.

In terms of a sequence being generated by a non-die process (but still has the same probability space), we can't really know this based on Marilyn's circumstance: we have assumed that someone else rolled a dice and therefore we construct the constraints we construct. Does that seem like a fair thing to do? If not why not?

You, I think, are trying to solve problem #4 too, but you're solving it by pretending it is two instances of problem #5, but the work you're describing is for solving problem #6.

I am specifically solving problem 6 yes, but I've outlined my reasoning above.

That last thing is one of the things I'm criticizing. People make very serious blunders by pretending like that. There's one situation I recall vividly: there was a gaming community that was trying to test whether some character attribute had any effect on the proportion of success. They gathered data that supported the hypothesis with well over 99% confidence... but they spent years believing there was no effect because some vocal analysts made a substitution similar to what you did:

We want to test if proportion 1 is bigger than proportion 2, right? Well, let's estimate the two proportions. (Compute two confidence intervals) The confidence intervals overlap, so the data isn't significant.
Whereas if they had done a test that was actually designed to answer the question at hand (a difference between proportions test), they would have seen the result as very significant.

Yes I have found that statistics and probability has a habit of getting people falling into that trap, and even for people that have been doing this for a long time it still can happen. But with respect to the answer, I thought it was clear what I was saying.

Problem #5 is of a typical philosophically interesting type, because we can't talk about the probability of the answer. We can't even give an answer of the sort "yes is more probable than no". We can, however, choose a strategy to answer the question such that if the true answer is "yes", then we will be correct over, e.g., 95% of the time.

I agree with you on this, but again I wasn't focusing on this.

But all of that aside, the main thing you're missing about problem #1 (and problem #6) that makes it very different from problems #2 through #5. We're not trying to answer questions about a single "process": we have two different processes, and we're trying to decide which processes produced the output we have. True, it can be difficult to get precise or accurate information about one of the processes, but that doesn't change the form of the problem.

(#6 and #1 are different because #6 has a single output and we're trying to guess which among many processes generated that output, and #1 has two processes with two outputs, and we're trying to say which one goes with which)

I never argued about that part of the problem. You might want to look at the response I had for those parts of Marilyn's statement. You made a statement about this and I agreed with you: again I'm not focusing on that part and I made it clear before what my thoughts were.

All that aside, if we tried to use your strategy to solve problem #1, you will have a low probability of success against many people: it is a well-known tendency for humans to generate fake data that is *too* uniform. For example, 66234441536125563152 is 1.5 standard deviations too uniform by the test I did. So, when you take the real and fake data, decide what bias is most likely on the die, and compare to fair, you will pick the overly uniform fake data over the randomly generated data most of the time.

Any question of the form <anything> versus 11111111111111111111 is very unlikely to ever come up except against a human opponent who is likely to make that sort of bluff, so your mis-analysis won't cost you much in this case. However, it will cost you big-time by picking the overly-uniform data too much.

Again, I agree that if a process has specific characteristics then regardless of what we 'think' it doesn't change the process. I didn't argue that and in fact I agreed with you if you go back a few pages in the thread. The process is what the process is.

The big thing I have learned from this is that in a conversation like this (and especially one this heated) we need to all be clear what we are talking about. It includes me but I think it also includes the other participants as well.

I will make the effort on my part to do this for future threads, especially ones of this type.

chiro · Mar 19, 2012

SidBala said:

I haven't read through the thread. But in short, she is right.

Any one valid string of dice rolls is just as probable as any other.

So what are people talking about for 9+ pages?

It's become a heated argument with a little bit of a misunderstanding on what other posters are specifically talking about thrown in for good measure :)

lugita15 · Mar 21, 2012

Here is Savant's latest on the subject, if anyone's interested: http://www.parade.com/askmarilyn/2012/03/19-die-rolling.html

She still isn't admitting defeat, and she's continuing to spread mathematical falsehoods.

Bacle2 · Mar 21, 2012

She seems a little too certain for someone who had to backpedal from her claims on the proof of Fermat's last theorem being flawed.

Is Marilyn Vos Savant wrong on this probability question?

Similar threads

Hot Threads

Recent Insights