# Homework Help: A statistics assignment (Binomial distribution)

Tags:
1. Nov 16, 2013

### anonymousk

So I've translated this assignment from another language, but hope it's good enough translated/understandable.

1. The problem statement, all variables and given/known data

According to the Statistics of Denmark, there was in the construction sector in the period 2009-2011 an average of 920 business bankruptcies per year out of a total population of 32.100 companies. These figures are considered to be representative of the construction sector (hereinafter abbreviated C-sector) for the period from 2009 and onwards. The number of businesses in the C-sector is specifically assumed to be constant from the year 2009 and onwards

a)
Explain the conditions under which the number of business bankruptcies in the C-sector during 2014 can be assumed to be binomially distributed.
Discuss the extent to which these assumptions are likely to be met in practice.

The number of business bankruptcies in the C-sector during 2014, XC, is hereinafter assumed to be binomially distributed with parameters nC and PC.

b)
Specify an estimate for PC.
Specify an estimate for the expected number of bankruptcies in the C-sector during 2014.
Calculate the variance of the estimator of PC.
Calculate the variance of the estimator of the expected number of bankruptcies in the C-sector during 2014.

c)
Explain that the number of bankruptcies in the C-sector during 2014 can be assumed to be approximately normally distributed.
Specify the parameters of the approximate normal distribution.

d)
Specify an approximate 95% confidence interval for PC.
Specify an approximate 95% confidence interval for the expected number of bankruptcies in the C-sector during 2014.

e)
Explain the reason, that we in question 1c) can conclude, that the number of business bankruptcies in the C-sector during 2014 is approximative binomially distributed, when we already know that the number actually is (exact) binomially distributed.

2. Relevant equations

(I apologize, but I simply cannot find the "sigma symbol in the toolbar" for making complex equations. Reason they are written like this is not a lack of effort, but I've spent a good 5 minutes only looking for that thing! Hope they are understandable though)

P(-Zα/2 < (Y - θ)/(√Var(Y)) < Zα/2

]Y-Zα/2*√Var(y) ; Y+Zα/2*√Var(Y)[

3. The attempt at a solution

I'm not really sure about a). I can say they are binomially distributed because either the business got bankrupt, or they dont. They have 2 outcomes, where success in this case is bankruptcy, and both outcomes are independent of each other, hence they are binomially distributed. Is this right?

no clue how to discuss the extent to which these assumptions are likely to be met in practice, hmpf.

b)

The estimate for Pc is P-hat = X/n = 920/32100 = 0,02866

The estimate for the expected bankruptcies is given in the text itself I believe? Which is X=920. So that's P-hatX = 920

The variance for the estimate PC is Var(PC)=nP(1-P) = 32100*0,02866(1-0,02866) = 29,89

The variance for the estimate of the expected value is? This is where it goes wrong for me .. Surely I cant do the nP(1-P) again .. 32100*920(1-920) .. it will give me a minus number.. So where did I mess up?

Thanks in advance. And if part of the assignment text isn't understandable/poorly written, tell me, and I can try and re-translate it.

Last edited: Nov 17, 2013
2. Nov 16, 2013

### haruspex

The two outcomes for one business are not independent - they are mutually exclusive!
If every business had the same probability of bankruptcy, and independently of the others, then it would certainly be binomial. The outcome for each business would be like a coin toss (with biased coin). The question is whether you can relax either of those constraints. Will it still be binomial if they have different probabilities but are independent? What if they're not independent?
It was 321000.
Careful - how many data values was PC calculated from?
920 is not a probability, so it makes no sense as the p in p(1-p).
What is the difference from the preceding question?

3. Nov 17, 2013

### anonymousk

Ah yes, they are mutually exclusive = both events cannot occur at the same time.

Hmm, the probability of success has to stay the same for all trials to be a binomial distribution, so no.

I believe it has to be independent trials to be a binomial distribution aswell, so also no?

Ah yeah, it was a typo. 920/32100 gives 0,02866, which is the estimate for P (P-hat)

Hmm, I don't think I'm really following. But it seems I might have used the wrong formula. I just found Var(P-Hat) = (P(1-P)/n) = (0,02866(1-0,02866)/920) = 0,000000867

Hmm.. Now we're working with an expected value E(X)? Not sure where to go from here either.

4. Nov 17, 2013

### haruspex

PC was computed as an average over three years of data. That will reduce the uncertainty.

5. Nov 17, 2013

### anonymousk

Reduce what uncertainty? I'm really lost here, and am really bad at statistics, but would really like to learn though.

6. Nov 17, 2013

### haruspex

Reconstruct the actual data as far as you can from that. How many bankruptcies in total over those years? Over how many companies (ignoring the fact that most of the companies are actually the same ones in successive years - treat them as an entirely fresh batch)? What does that give you for the var of the total bankruptcies over three years? Bearing in mind that var is the expected value of the square of the difference from the mean, what var do you get when you scale that back down to a single year (one third of the population)?
The point here is that variance is a measure of uncertainty in the estimated value for the mean. If you calculate the mean from a larger sample (three years of data instead of one here) that must reduce the var.

7. Nov 17, 2013

### anonymousk

So that's 2760 bankruptcies total over 3 years over a total of 32100 companies.
2760/32100 = 0,08598

I'm not sure if I should use Var=(P(1-P)/n) or Var=nP(1-P)
Mind explaining why/when to use one or the other?

I'll use np(1-p) = 32100*0,08598(1-0,08598)=2522,6567 is the variance for total bankruptcies over 3 years.

Var=(X-μ)2, hmm.. I'm really lost. I feel like i can't grasp this assignment.

Edit: I'm gonna start over.. So the estimate for PC i got right (0,02866) .. And the estimate for the expected number bankruptcies during 2014.. Isn't that given in the text itself?
"The number of business bankruptcies in the C-sector during 2014, XC"
E(X)=920

8. Nov 17, 2013

### haruspex

No, the population is now 3 * 32100. As I said, you have to pretend the companies are independent from year to year.
No the prob should be the same as before, but the average (now being a total over 3 years) should be three times larger, and so should the var.
Call the var of the 3 year total V3. If the actual annual average is μ and the estimated average is $\hat{\mu}$ then $E((3\mu-3\hat\mu)^2) = V_3$. So what is $E((\mu-\hat\mu)^2)$?

9. Nov 17, 2013

### anonymousk

I see.
Annual average=920
If PC was as an average over three years of data, is PC then my $\hat\mu$?

I can't seem to understand how to define my $\hat\mu$.

E((920-0,02866)2) and E((920-2760)2) both give numbers that seem very wrong, hah. I'm waiting to get the "A-HAH" moment here.

10. Nov 18, 2013

### anonymousk

Okay, I've now slept on it.

Is the estimate the same in these both??
Estimate for PC being (920*3)/(32100*3) = 0,02866
And estimate for the expected number of bankruptcies during 2014 being 920/32100= 0,02866, but it's the variance where they differ.

The variance for PC is 96300*0,02866(1-0,02866) = 2680,86
The variance for the expected value is 32100*0,02866(1-0,02866) = 893,62

$E((\mu-\hat\mu)^2)$ is the variance for the estimate of the expected number bankruptcies during 2014. But I'm unable to continue from there.

11. Nov 18, 2013

### haruspex

There are Y years with N businesses in each year. Because we are treating the years as independent, the Y years of data are just like one year of data with Y*N businesses. Y*B bankruptcies are observed. (I write it that way because we are told B, not Y*B.)
Let the actual prob of a given business going bankrupt in a given year be p. Our estimate of the probability is $\hat p = (Y B)/(Y N) = B/N$. (Writing a hat over an unknown parameter is a standard way of representing our estimate of the parameter.)
The observed number for the Y years is Y*B, so this is $\hat \mu_Y$, our estimate of the average number of bankruptcies in Y years.
Correspondingly, our estimate for the number in one year is $= \hat \mu_Y/Y = B$.
We know that the variance in $\hat \mu_Y$ (the mean square error) is $V_Y = E((\hat \mu_Y - \mu_Y)^2) = Y N p(1-p)$, and we estimate that as $Y N \hat p(1-\hat p)$.
So now we want to know $V = E((\hat \mu - \mu)^2)$.
What is the relationship between $\mu$ and $\mu_Y$? What do you get for V by combining those equations?

12. Nov 20, 2013

### anonymousk

Been looking at this for 2 days now, and simply cannot get what you're trying to lure me towards here (combining those equations.)

However, I looked into Standard Error of the Mean

What I'm sitting with at the moment is

Variance of the estimate PC:
σ/√n = 29,9/√3 = 17,26 = std dev
σ2=17,262= 297,91.

And the variance for the expected number bankruptcies:
0,00093/√3 = 0,00054
σ2=0,000542 = 0,00000029

I put N=3 as sample size since it's the average of 3 years, (2009 2010 2011).

If it's correct I have no idea.

Last edited: Nov 20, 2013
13. Nov 20, 2013

### haruspex

Yes, you have to divide by √3. I was trying to avoid appeal to the authority of such formulae because I felt it would not give you any further insight into what happens. So FWIW I'll explain the path I was on:
$\hat p = (Y B)/(Y N) = B/N$.
Estimate of the average number of bankruptcies in Y years, Y*B = $\hat \mu_Y$.
Estimate for the number in one year is $= \hat \mu_Y/Y = B$.
$V_Y = E((\hat \mu_Y - \mu_Y)^2) = Y N p(1-p)$
$\hat \mu_Y = Y N \hat p(1-\hat p)$.
We want to know $V = E((\hat \mu - \mu)^2)$.
$\mu = \mu_Y/Y$, $\hat\mu = \hat\mu_Y/Y$
So
$V = E((\hat \mu_Y/Y - \mu_Y/Y)^2) = E((\hat \mu_Y - \mu_Y)^2)/Y^2 = Y N p(1-p)/Y^2 = N p(1-p)/Y$

But you also seem to be a bit confused between number of bankruptcies and the probability of a bankruptcy.
In the OP, PC is the probability of a bankruptcy (for a given business in a given year). I'm just going to write that as p for convenience. To recap:
- p is the actual probability of an individual bankruptcy
- if we observe B*Y bankruptcies over N*Y business-years then our estimate for p is $\hat p = B/N$
- $\hat p$ will not in general be equal to p; it is only our estimate for it based on the observations. The mean square error (variance) in the estimate is p(1-p)/(number of observations) = p(1-p)/(NY).
- That doesn't quite get us there because we don't know p, we only know $\hat p$. It turns out that an unbiased estimate of the variance is $\hat p(1-\hat p)/(NY-1)$, but given that NY is quite large here the -1 won't make much difference.
- Having estimated p, we can calculate an expected number of bankruptcies, En, in some other year. $E_n = N\hat p$.
- It remains to find the variance of En. If we knew p we could write it down as Np(1-p). But we know $\hat p$, not p. So what do you think the variance of En is?

14. Nov 21, 2013

### anonymousk

I appreciate that. It's quite frustrating using these formulas not knowing what's happening behind it all. Still struggling to understand this all, but feel like I'm getting there.(slowly)

Just to be clear: Y in this case will be the 3 years the data is calculated from? (2009, 2010 and 2011). So it can be a number from 1-3.

Just to be clear/sure again: The assignment only asks for the estimate of PC, right? The P-hat = b/n. Finding the real probability of P is just extra you did over what is originally asked for?
And same with "an estimate for the expected number of bankruptcies in the C-sector during 2014.". The estimate number in 2014 would be 920.
Since $\hat p$ is all we know, my best guess is N$\hat p$(1-$\hat p$), but not sure where to go from there.

15. Nov 21, 2013

### haruspex

You are only given the aggregate of the data for three years, so Y=3, not any number from 1 to 3.
There is no way to find the 'real' probability (if there can be said to be such a thing. You can express various things in terms of this theoretical p, but we can't use that to find the value of anything. We have to rely on $\hat p$.
Yes.
A confession: I left that hanging as a question to play for time. I needed to think a bit more myself.
There are two sources of uncertainty (variance) in what the actual number will be in 2014:
- The uncertainty arising from our uncertainty in how well the value $\hat p$ approximates p.
- Even if we knew p, random variation comes into play.
Your expression N$\hat p$(1-$\hat p$) correctly gives (an estimate for) the second of those. I believe that the two sources of error are independent, so we can simply add the variances they give rise to. What do you think the variance is in the prediction for 2014 that arises from uncertainty in $\hat p$?

16. Nov 23, 2013

### anonymousk

Hmm.. I dont know. I dont understand why the variance for PC is a higher number than the variance for the expected number bankruptcies during 2014 either.

Edit:

Im also unsure of the numbers I calculated now.

The two variances I found

297,91 for PC and the variance for the expected number bankruptcies 0,00000029. Shouldnt it be the other way around??

And I found 0,00000029 by variance of mean/n2. Maybe I've been sitting with this assignment for too long and got really confused by it all, but was this the right way to calculate the variance for the expected number of bankruptcies during 2014?

Last edited: Nov 23, 2013
17. Nov 23, 2013

### haruspex

As I wrote, you seemed to be getting confused between the probability and the expected number of bankruptcies.
The estimate for p (i.e. PC) is $\hat p = 920/32100 = 0,029$.
The variance in $\hat p$ is p(1-p)/(NY) = p(1-p)/(3*32100), which we estimate as $\hat p(1-\hat p)/(3*32100)$. See my 'recap' post.
If you are still getting different results for those by your methods, please post your current working and I will endeavour to spot where you are going wrong.

If we knew p exactly, the variance for the expected number bankruptcies in 2014 would be Np(1-p), which we would estimate as $N\hat p(1-\hat p)$. But we don't know p exactly, so we have to add another variance component to account for that. I have left it as an exercise for you to suggest what that extra variance should be.

18. Nov 23, 2013

### anonymousk

$\widehat{P}$ I've got right. 0,02866

Estimate for the expected number bankruptcies is the 3 year average, $\overline{X}$=920

32100*0,02866(1-0,02866) = 893,6192/3 = 297,87

This is where I'm confused.

I divided 893,6192 with 321002 = 0,00000087/3 = 0,00000029 (which is the variance of the expected number bankruptcies for 2014

Hmm, I think what you are referring to is dividing by 3, because of the interval we get our numbers from.

19. Nov 23, 2013

### haruspex

No. I don't understand why you keep trying to apply that formula to get the variance in the estimator of the probability. That formula, Np(1-p), gives the variance in the observed number of bankruptcies if there are N in the sample and the probability is p. I gave you the correct formula in the previous post and elsewhere.
That's obviously much too low. I don't understand that calculation. This is where you use the Np(1-p) formula. As I posted earlier today:
We have computed the variance in our estimator for p. If we knew p exactly we would estimate the number of bankruptcies in 2014 as Np. If our value for p is wrong by an amount x, how much would that displace our estimate for the number of bankruptcies in 2014?

20. Nov 23, 2013

### anonymousk

0,02866(1-0,02866)/(3*32100) = 0,000000289
This is the variance of the estimator of P. Again a really low number. Thinking of giving up on this assignment, because it simply doesn't catch.

32100*0,02866(1-0,02866)/3 = 297.. This number makes sense .. so the standard deviation from the mean for a single year would be 17,25.

Hmm im thinking it would displace our number by 'variance of estimator for p'.