There’s a google interview question that’s been discussed quite a bit in various forums recently. The question is: There is a certain country where everybody has to have a son. Therefore each couple keeps having children until they have a boy, then they stop. In expectation, what fraction of births are female? Apparently the “official google answer” is 0.5 due to the fact that the expectation of each birth being a girl (or boy) is always 0.5. The “stopping rule” does not change this expectation. The controversy is that some say that the official google answer and argument are wrong. Any comments?
Well, they ask "What fraction of births are female?" But a birth is always 50/50, so half the births are female. However, if you would ask "What fraction of the population is female", that's a more difficult question. But I guess that the answer would be 0.5 to...
I assume you mean the recent imbroglio created by Steven Landsburg, including his $15,000 bet (http://www.thebigquestions.com/2010/12/27/win-landsburgs-money/). Two comments: 1. Landsburg is wrong. 2. He won't lose any money. There are essentially two ways to set up this simulation. The simulation could be set up to answer the question as raised, in which case the outcome will be 50-50. Landsburg will not agree to such a set up. The simulation could also be set up to produce a biased estimator, in which case the other party should not agree to the setup. It isn't answering the question as raised.
Micromass- yes "what fraction of the pollution are female is how it phrased". It was changed to female births to reflect the fact that most discussions of this problem, seem to ignore the mothers- who of course, are female. DH- Why is Landsburg wrong? Can you put your finger on which part of his logic is faulty? Is it the use of the family as a model for a country?
I don't know what his problem is (or rather, I don't know why he is being so persistent). I do see what his problem is. He has been told in multiple places, in multiple ways, and in no uncertain terms that he is wrong. His problem is far even more basic than a biased estimator. Hhe is doing statistics incorrectly. His 44% figure arises from a high school, maybe even grade school, algebra mistake. Suppose you drive 10 miles at 10 miles per hour and another 10 miles at 20 miles per hour. What is your average velocity? By his logic, the answer would be 15 miles per hour.
He's calculating the wrong thing. The average fraction of girls per family is not the same as the fraction of girls in the overall population. Consider the following case: there are 10 families in town. 9 have one girl and no boys, and 1 has 9 boys and no girls. In this case, the fraction of girls in the population is 50%: 9/(9+9). However, the average fraction of girls per family is 90%.
Given that each couple have completed their job of getting one boy and stopping at that moment, the possibilities for each couple are B, GB,GGB,GGGB,... and so on, where B is boy and G is girl. The probability for each outcome is respectively 1/2,1/4,1/8, hence the expectation for fraction of boys in some family is 1/2*1+1/4*1/2+1/8*1/3+... = [itex]\sum_{k=1}^{\infty} \frac{1}{2^k k}[/itex], and the expectation for the fraction of girls in some family is the complement of this. Define [itex]f(x) = \sum_{n=1}^{\infty} \frac{x^n}{2^n n}[/itex]. We have [itex]f^{\prime}(x) = \sum_{n=1}^{\infty} \frac{x^{n-1}}{2^n} = \frac{1}{2}\sum_{n=0}^{\infty} (\frac{x}{2})^n = \frac{1}{2-x}[/itex], hence [itex]f(x) = C-\log(2-x)[/itex]. Since f(0) = 0, we have [itex]C = \log 2[/itex], giving "fraction of boys" = [itex]f(1) = \log 2 - \log 1 = \log 2[/itex]. Thus the fraction of girls is [itex]1-\log 2[/itex]. I am having second thoughts on this, it is confusing. I am almost certain my answer above is wrong, but I don't really see why.
You are doing exactly what Vanadium talked about in post #6. Landsburg did something different, but also incorrect.
Right, so 1-log(2) is the expected fraction of girls in some given family, not corresponding to the overall fraction. This is the algebraic mistake a/b + c/d = (a+c)/(b+d). I believe this is an illustration of Simpson's paradox.
I have never heard of Lansburg before this thread, but I consider the proposition to be an inappropriate application of probability theory. Life experience, that is measurement, of the birth rate suggests that P_{girl} = P_{boy} = 0.5 We should not apply the statistics of perfect unbiased coin tossing to a situation where we can never prove the event (birth) to be unbiased. We can define or even manufacture unbiased coins to any required degree of acuracy, but we cannot manufacture unbiased mothers. We cannot even show that birth events are not linked; birth statistics suggest otherwise, so while the overall probability approaches 0.5 girls or boys appear to run in families.
I was thinking this same thing: Look at at those girls some families have! But looking more closely, with this model, every family has exactly one Boy - even if they first have 12 Girls - or whatever number. 1/2 of the families have 0 Girls. → B 1/4 of the families have exactly 1 Girl. → BG 1/8 of the families have exactly 2 Girls. → BGG 1/16 of the families have exactly 3 Girls. → BGGG ... 1/2^{n} of the families have exactly (n-1) Girls. → BGG...G If we have 4096 families (A round number for computer scientists), 4096 Boys will be born. The number of girls will be: 1024 +2×512 +3×256 +4×128 +5×64 +6×32 +7×16 +8×8 +9×4 +10×2 +11×1 +12×(1/2) +13×(1/4) +... (Well, that's getting to be ridiculous.) = 1024 + 1024 + 768 + 512 + 320 + 192 + 112 + 64 + 36 + 20 + 11 + the terms with fractions of a family. = 4083 Girls in 2047 of the 2048 families with girls. The remaining family has at least 11 Girls with .5 probability of having (exactly) 1 more, .25 prob. of 2 more, ... That gives us at least 4094 Girls and 4096 Boys. Maybe we should take a Boy from that last family, since they're not finished having Girls! At any rate, the number of each is essentially the same. If I had started with 1 Giga-Family we would be left with only that discrepancy of one or two.
At some level, this problem reduces to "what is the name of the bus driver's mother?" The answer is given in the question: a 50% chance of giving birth to a boy means that in a sample of N births, as N gets large, the number of boys approaches N/2. I mention in passing that if it were possible to change this expectation by a stopping algorithm, that same algorithm could be used with lottery tickets.
Here is the response I just posted on Landsburg's blog. Seems to make sense. I apologize in advance for the crappy notation.
I think you can formulate an equivalent problem in the following way: You're given a binary string, x, of length N bits. Each bit in the string is chosen randomly from {0,1}, except the last bit of the string, whose value is always 1. Let z be the number of zeros occurring in x. What is the expectation value of z/N. The population problem can be cast in terms of the binary string problem by representing the birth sequence for each family as a binary string (so that for example GGB = 001, GGGB = 0001, etc) then concatenating the binary strings for each family into one big long string.
There is an easy way to look at the problem: "births are always 50/50". The policy does nothing done to change that. Births are still 50/50, so the population will still be 50/50. If you compute a result that says otherwise says you need to double-check your assumptions and your calculations.
This isn't the problem since this would allow 0,0,0,0,1,0,0,1 which isn't allowed in the orignal question. The only "1" in your string has to be the terminal one. If you are trying to think of how to model this with a computer (which is what your langauge seems to suggest that you’re thinking of) it’s really simple. Let r be your random variable, let b,g,t count the number of boys girls and total respectivly. Let i be a dummy variable and p be your population size. Initialize everything to 0. In C it would look something like this if you very litterally put it in code (my syntax is off I’m sure, I haven’t programed in a few years). 1. do while i < p; 2. { 3. r = rnd(0,1); 4. Do while r=0; 5. { g++; 6. t++; 7. r = rnd(0,1); } 8. b++; 9. t++; 10.i++} think of adding one to i as starting a new family. But this is obviously equivlent to: 1. do while i < p; 2. {i++ 3. r = rnd(0,1); 4. if r=0 then g++, t++; go to line 1; 5. if r=1 then b++, t++ got to line 1; 6. } Which trivially gives a 1:1 ratio statistically.
0,0,0,0,1,0,0,1 is perfectly valid. This would correspond to sampling 2 families, the first one of which had 4 girls and one boy, and the second of which had 2 girls and 1 boy. Edit: I wasn't thinking of how to calculate this computationally, which is a fairly easy task. I'm just trying to cut away the semantics of the problem. It's easy to see that the number of zeros in a random binary string will average to half the length of the string.
well, the bias is not actually 50/50, as it is usually 105 boys to 100 girls; therefore, it should be 51:50 or 0.51219...repeated to infinity. Tell me if I am wrong.