A statistics assignment (Binomial distribution)

anonymousk · Nov 16, 2013

So I've translated this assignment from another language, but hope it's good enough translated/understandable.

Homework Statement

According to the Statistics of Denmark, there was in the construction sector in the period 2009-2011 an average of 920 business bankruptcies per year out of a total population of 32.100 companies. These figures are considered to be representative of the construction sector (hereinafter abbreviated C-sector) for the period from 2009 and onwards. The number of businesses in the C-sector is specifically assumed to be constant from the year 2009 and onwardsa)
Explain the conditions under which the number of business bankruptcies in the C-sector during 2014 can be assumed to be binomially distributed.
Discuss the extent to which these assumptions are likely to be met in practice.

The number of business bankruptcies in the C-sector during 2014, X_C, is hereinafter assumed to be binomially distributed with parameters n_C and P_C.

b)
Specify an estimate for P_C.
Specify an estimate for the expected number of bankruptcies in the C-sector during 2014.
Calculate the variance of the estimator of P_C.
Calculate the variance of the estimator of the expected number of bankruptcies in the C-sector during 2014.

c)
Explain that the number of bankruptcies in the C-sector during 2014 can be assumed to be approximately normally distributed.
Specify the parameters of the approximate normal distribution.

d)
Specify an approximate 95% confidence interval for P_C.
Specify an approximate 95% confidence interval for the expected number of bankruptcies in the C-sector during 2014.

e)
Explain the reason, that we in question 1c) can conclude, that the number of business bankruptcies in the C-sector during 2014 is approximative binomially distributed, when we already know that the number actually is (exact) binomially distributed.

Homework Equations

(I apologize, but I simply cannot find the "sigma symbol in the toolbar" for making complex equations. Reason they are written like this is not a lack of effort, but I've spent a good 5 minutes only looking for that thing! Hope they are understandable though)

P(-Z_α/2 < (Y - θ)/(√Var(Y)) < Z_α/2

]Y-Z_α/2*√Var(y) ; Y+Z_α/2*√Var(Y)[

The Attempt at a Solution

I'm not really sure about a). I can say they are binomially distributed because either the business got bankrupt, or they dont. They have 2 outcomes, where success in this case is bankruptcy, and both outcomes are independent of each other, hence they are binomially distributed. Is this right?

no clue how to discuss the extent to which these assumptions are likely to be met in practice, hmpf.

b)

The estimate for P_c is P-hat = X/n = 920/32100 = 0,02866

The estimate for the expected bankruptcies is given in the text itself I believe? Which is X=920. So that's P-hat_X = 920

The variance for the estimate P_C is Var(P_C)=nP(1-P) = 32100*0,02866(1-0,02866) = 29,89

The variance for the estimate of the expected value is? This is where it goes wrong for me .. Surely I can't do the nP(1-P) again .. 32100*920(1-920) .. it will give me a minus number.. So where did I mess up?

Thanks in advance. And if part of the assignment text isn't understandable/poorly written, tell me, and I can try and re-translate it.

haruspex · Nov 16, 2013

anonymousk said:

I'm not really sure about a). I can say they are binomially distributed because either the business got bankrupt, or they dont. They have 2 outcomes, where success in this case is bankruptcy, and both outcomes are independent of each other, hence they are binomially distributed. Is this right?

The two outcomes for one business are not independent - they are mutually exclusive!
If every business had the same probability of bankruptcy, and independently of the others, then it would certainly be binomial. The outcome for each business would be like a coin toss (with biased coin). The question is whether you can relax either of those constraints. Will it still be binomial if they have different probabilities but are independent? What if they're not independent?

b)

The estimate for P_c is P-hat = X/n = 920/31000 = 0,02866

It was 321000.

The variance for the estimate PC is Var(P_C)=nP(1-P) = 32100*0,02866(1-0,02866) = 29,89

Careful - how many data values was P_C calculated from?

The variance for the estimate of the expected value is? This is where it goes wrong for me .. Surely I can't do the nP(1-P) again .. 32100*920(1-920) .. it will give me a minus number.. So where did I mess up?

920 is not a probability, so it makes no sense as the p in p(1-p).
What is the difference from the preceding question?

anonymousk · Nov 17, 2013

haruspex said:

the two outcomes for one business are not independent - they are mutually exclusive!
If every business had the same probability of bankruptcy, and independently of the others, then it would certainly be binomial. The outcome for each business would be like a coin toss (with biased coin).

Ah yes, they are mutually exclusive = both events cannot occur at the same time.

The question is whether you can relax either of those constraints. Will it still be binomial if they have different probabilities but are independent?

Hmm, the probability of success has to stay the same for all trials to be a binomial distribution, so no.

What if they're not independent?

I believe it has to be independent trials to be a binomial distribution aswell, so also no?

It was 321000.

Ah yeah, it was a typo. 920/32100 gives 0,02866, which is the estimate for P (P-hat)

Careful - how many data values was p_c calculated from?

Hmm, I don't think I'm really following. But it seems I might have used the wrong formula. I just found Var(P-Hat) = (P(1-P)/n) = (0,02866(1-0,02866)/920) = 0,000000867

920 is not a probability, so it makes no sense as the p in p(1-p).
What is the difference from the preceding question?

Hmm.. Now we're working with an expected value E(X)? Not sure where to go from here either.

haruspex · Nov 17, 2013

anonymousk said:

Hmm, I don't think I'm really following. But it seems I might have used the wrong formula. I just found Var(P-Hat) = (P(1-P)/n) = (0,02866(1-0,02866)/920) = 0,000000867

P_C was computed as an average over three years of data. That will reduce the uncertainty.

anonymousk · Nov 17, 2013

haruspex said:

P_C was computed as an average over three years of data. That will reduce the uncertainty.

Reduce what uncertainty? I'm really lost here, and am really bad at statistics, but would really like to learn though.

haruspex · Nov 17, 2013

anonymousk said:

Reduce what uncertainty? I'm really lost here, and am really bad at statistics, but would really like to learn though.

in the period 2009-2011 an average of 920 business bankruptcies per year out of a total population of 32.100 companies

Reconstruct the actual data as far as you can from that. How many bankruptcies in total over those years? Over how many companies (ignoring the fact that most of the companies are actually the same ones in successive years - treat them as an entirely fresh batch)? What does that give you for the var of the total bankruptcies over three years? Bearing in mind that var is the expected value of the square of the difference from the mean, what var do you get when you scale that back down to a single year (one third of the population)?
The point here is that variance is a measure of uncertainty in the estimated value for the mean. If you calculate the mean from a larger sample (three years of data instead of one here) that must reduce the var.

anonymousk · Nov 17, 2013

So that's 2760 bankruptcies total over 3 years over a total of 32100 companies.
2760/32100 = 0,08598

I'm not sure if I should use Var=(P(1-P)/n) or Var=nP(1-P)
Mind explaining why/when to use one or the other?

I'll use np(1-p) = 32100*0,08598(1-0,08598)=2522,6567 is the variance for total bankruptcies over 3 years.

bearing in mind that var is the expected value of the square of the difference from the mean, what var do you get when you scale that back down to a single year (one third of the population)?

Var=(X-μ)², hmm.. I'm really lost. I feel like i can't grasp this assignment.

Edit: I'm going to start over.. So the estimate for P_C i got right (0,02866) .. And the estimate for the expected number bankruptcies during 2014.. Isn't that given in the text itself?
"The number of business bankruptcies in the C-sector during 2014, X_C"
E(X)=920

haruspex · Nov 17, 2013

anonymousk said:

So that's 2760 bankruptcies total over 3 years over a total of 32100 companies.
2760/32100 = 0,08598

No, the population is now 3 * 32100. As I said, you have to pretend the companies are independent from year to year.

I'll use np(1-p) = 32100*0,08598(1-0,08598)=2522,6567 is the variance for total bankruptcies over 3 years.

No the prob should be the same as before, but the average (now being a total over 3 years) should be three times larger, and so should the var.
Call the var of the 3 year total V₃. If the actual annual average is μ and the estimated average is ##\hat{\mu}## then ##E((3\mu-3\hat\mu)^2) = V_3##. So what is ##E((\mu-\hat\mu)^2) ##?

anonymousk · Nov 17, 2013

haruspex said:

No, the population is now 3 * 32100. As I said, you have to pretend the companies are independent from year to year.

No the prob should be the same as before, but the average (now being a total over 3 years) should be three times larger, and so should the var.
Call the var of the 3 year total V₃. If the actual annual average is μ and the estimated average is ##\hat{\mu}## then ##E((3\mu-3\hat\mu)^2) = V_3##. So what is ##E((\mu-\hat\mu)^2) ##?

I see.
Annual average=920
If P_C was as an average over three years of data, is P_C then my ##\hat\mu##?

I can't seem to understand how to define my ##\hat\mu##.

E((920-0,02866)²) and E((920-2760)²) both give numbers that seem very wrong, hah. I'm waiting to get the "A-HAH" moment here.

anonymousk · Nov 18, 2013

Okay, I've now slept on it.

Specify an estimate for P_C.
Specify an estimate for the expected number of bankruptcies in the C-sector during 2014.

Is the estimate the same in these both??
Estimate for P_C being (920*3)/(32100*3) = 0,02866
And estimate for the expected number of bankruptcies during 2014 being 920/32100= 0,02866, but it's the variance where they differ.

The variance for P_C is 96300*0,02866(1-0,02866) = 2680,86
The variance for the expected value is 32100*0,02866(1-0,02866) = 893,62

haruspex said:

Call the var of the 3 year total V₃. If the actual annual average is μ and the estimated average is ##\hat{\mu}## then ##E((3\mu-3\hat\mu)^2) = V_3##. So what is ##E((\mu-\hat\mu)^2) ##?

##E((\mu-\hat\mu)^2) ## is the variance for the estimate of the expected number bankruptcies during 2014. But I'm unable to continue from there.

haruspex · Nov 18, 2013

anonymousk said:

Is the estimate the same in these both??

There are Y years with N businesses in each year. Because we are treating the years as independent, the Y years of data are just like one year of data with Y*N businesses. Y*B bankruptcies are observed. (I write it that way because we are told B, not Y*B.)
Let the actual prob of a given business going bankrupt in a given year be p. Our estimate of the probability is ##\hat p = (Y B)/(Y N) = B/N##. (Writing a hat over an unknown parameter is a standard way of representing our estimate of the parameter.)
The observed number for the Y years is Y*B, so this is ##\hat \mu_Y##, our estimate of the average number of bankruptcies in Y years.
Correspondingly, our estimate for the number in one year is ## = \hat \mu_Y/Y = B##.
We know that the variance in ##\hat \mu_Y## (the mean square error) is ##V_Y = E((\hat \mu_Y - \mu_Y)^2) = Y N p(1-p)##, and we estimate that as ##Y N \hat p(1-\hat p)##.
So now we want to know ##V = E((\hat \mu - \mu)^2) ##.
What is the relationship between ##\mu## and ##\mu_Y##? What do you get for V by combining those equations?

anonymousk · Nov 20, 2013

Been looking at this for 2 days now, and simply cannot get what you're trying to lure me towards here (combining those equations.)

However, I looked into Standard Error of the Mean

What I'm sitting with at the moment is

Variance of the estimate P_C:
σ/√n = 29,9/√3 = 17,26 = std dev
σ²=17,26²= 297,91.And the variance for the expected number bankruptcies:
0,00093/√3 = 0,00054
σ²=0,00054² = 0,00000029

I put N=3 as sample size since it's the average of 3 years, (2009 2010 2011).

If it's correct I have no idea.

haruspex · Nov 20, 2013

anonymousk said:

Been looking at this for 2 days now, and simply cannot get what you're trying to lure me towards here (combining those equations.)

However, I looked into Standard Error of the Mean

What I'm sitting with at the moment is

Variance of the estimate P_C:
σ/√n = 29,9/√3 = 17,26 = std dev
σ²=17,26²= 297,91.

And the variance for the expected number bankruptcies:
0,00093/√3 = 0,00054
σ²=0,00054² = 0,00000029

I put N=3 as sample size since it's the average of 3 years, (2009 2010 2011).

If it's correct I have no idea.

Yes, you have to divide by √3. I was trying to avoid appeal to the authority of such formulae because I felt it would not give you any further insight into what happens. So FWIW I'll explain the path I was on:
##\hat p = (Y B)/(Y N) = B/N##.
Estimate of the average number of bankruptcies in Y years, Y*B = ##\hat \mu_Y##.
Estimate for the number in one year is ## = \hat \mu_Y/Y = B##.
##V_Y = E((\hat \mu_Y - \mu_Y)^2) = Y N p(1-p)##
##\hat \mu_Y = Y N \hat p(1-\hat p)##.
We want to know ##V = E((\hat \mu - \mu)^2) ##.
##\mu = \mu_Y/Y##, ##\hat\mu = \hat\mu_Y/Y##
So
##V = E((\hat \mu_Y/Y - \mu_Y/Y)^2) = E((\hat \mu_Y - \mu_Y)^2)/Y^2 = Y N p(1-p)/Y^2 = N p(1-p)/Y##

But you also seem to be a bit confused between number of bankruptcies and the probability of a bankruptcy.
In the OP, P_C is the probability of a bankruptcy (for a given business in a given year). I'm just going to write that as p for convenience. To recap:
- p is the actual probability of an individual bankruptcy
- if we observe B*Y bankruptcies over N*Y business-years then our estimate for p is ##\hat p = B/N##
- ##\hat p## will not in general be equal to p; it is only our estimate for it based on the observations. The mean square error (variance) in the estimate is p(1-p)/(number of observations) = p(1-p)/(NY).
- That doesn't quite get us there because we don't know p, we only know ##\hat p##. It turns out that an unbiased estimate of the variance is ##\hat p(1-\hat p)/(NY-1)##, but given that NY is quite large here the -1 won't make much difference.
- Having estimated p, we can calculate an expected number of bankruptcies, En, in some other year. ##E_n = N\hat p##.
- It remains to find the variance of En. If we knew p we could write it down as Np(1-p). But we know ##\hat p##, not p. So what do you think the variance of En is?

anonymousk · Nov 21, 2013

Yes, you have to divide by √3. I was trying to avoid appeal to the authority of such formulae because I felt it would not give you any further insight into what happens.

I appreciate that. It's quite frustrating using these formulas not knowing what's happening behind it all. Still struggling to understand this all, but feel like I'm getting there.(slowly)

##\hat p = (Y B)/(Y N) = B/N##.
Estimate of the average number of bankruptcies in Y years, Y*B = ##\hat \mu_Y##.
Estimate for the number in one year is ## = \hat \mu_Y/Y = B##.
##V_Y = E((\hat \mu_Y - \mu_Y)^2) = Y N p(1-p)##
##\hat \mu_Y = Y N \hat p(1-\hat p)##.
We want to know ##V = E((\hat \mu - \mu)^2) ##.
##\mu = \mu_Y/Y##, ##\hat\mu = \hat\mu_Y/Y##
So
##V = E((\hat \mu_Y/Y - \mu_Y/Y)^2) = E((\hat \mu_Y - \mu_Y)^2)/Y^2 = Y N p(1-p)/Y^2 = N p(1-p)/Y##

Just to be clear: Y in this case will be the 3 years the data is calculated from? (2009, 2010 and 2011). So it can be a number from 1-3.

But you also seem to be a bit confused between number of bankruptcies and the probability of a bankruptcy.
In the OP, P_C is the probability of a bankruptcy (for a given business in a given year). I'm just going to write that as p for convenience. To recap:
- p is the actual probability of an individual bankruptcy
- if we observe B*Y bankruptcies over N*Y business-years then our estimate for p is ##\hat p = B/N##
- ##\hat p## will not in general be equal to p; it is only our estimate for it based on the observations. The mean square error (variance) in the estimate is p(1-p)/(number of observations) = p(1-p)/(NY).
- That doesn't quite get us there because we don't know p, we only know ##\hat p##. It turns out that an unbiased estimate of the variance is ##\hat p(1-\hat p)/(NY-1)##, but given that NY is quite large here the -1 won't make much difference.
- Having estimated p, we can calculate an expected number of bankruptcies, En, in some other year. ##E_n = N\hat p##.

Just to be clear/sure again: The assignment only asks for the estimate of P_C, right? The P-hat = b/n. Finding the real probability of P is just extra you did over what is originally asked for?
And same with "an estimate for the expected number of bankruptcies in the C-sector during 2014.". The estimate number in 2014 would be 920.

- It remains to find the variance of En. If we knew p we could write it down as Np(1-p). But we know ##\hat p##, not p. So what do you think the variance of En is?

Since ##\hat p## is all we know, my best guess is N##\hat p##(1-##\hat p##), but not sure where to go from there.

haruspex · Nov 21, 2013

anonymousk said:

Just to be clear: Y in this case will be the 3 years the data is calculated from? (2009, 2010 and 2011). So it can be a number from 1-3.

You are only given the aggregate of the data for three years, so Y=3, not any number from 1 to 3.

Just to be clear/sure again: The assignment only asks for the estimate of P_C, right? The P-hat = b/n. Finding the real probability of P is just extra you did over what is originally asked for?

There is no way to find the 'real' probability (if there can be said to be such a thing. You can express various things in terms of this theoretical p, but we can't use that to find the value of anything. We have to rely on ##\hat p##.

And same with "an estimate for the expected number of bankruptcies in the C-sector during 2014.". The estimate number in 2014 would be 920.

Yes.

Since ##\hat p## is all we know, my best guess is N##\hat p##(1-##\hat p##), but not sure where to go from there.

A confession: I left that hanging as a question to play for time. I needed to think a bit more myself.
There are two sources of uncertainty (variance) in what the actual number will be in 2014:
- The uncertainty arising from our uncertainty in how well the value ##\hat p## approximates p.
- Even if we knew p, random variation comes into play.
Your expression N##\hat p##(1-##\hat p##) correctly gives (an estimate for) the second of those. I believe that the two sources of error are independent, so we can simply add the variances they give rise to. What do you think the variance is in the prediction for 2014 that arises from uncertainty in ##\hat p##?

anonymousk · Nov 23, 2013

Hmm.. I don't know. I don't understand why the variance for P_C is a higher number than the variance for the expected number bankruptcies during 2014 either.

Edit:

Im also unsure of the numbers I calculated now.

The two variances I found

297,91 for P_C and the variance for the expected number bankruptcies 0,00000029. Shouldnt it be the other way around??

And I found 0,00000029 by variance of mean/n². Maybe I've been sitting with this assignment for too long and got really confused by it all, but was this the right way to calculate the variance for the expected number of bankruptcies during 2014?

haruspex · Nov 23, 2013

anonymousk said:

297,91 for P_C and the variance for the expected number bankruptcies 0,00000029. Shouldnt it be the other way around??

As I wrote, you seemed to be getting confused between the probability and the expected number of bankruptcies.
The estimate for p (i.e. P_C) is ##\hat p = 920/32100 = 0,029##.
The variance in ##\hat p## is p(1-p)/(NY) = p(1-p)/(3*32100), which we estimate as ##\hat p(1-\hat p)/(3*32100)##. See my 'recap' post.
If you are still getting different results for those by your methods, please post your current working and I will endeavour to spot where you are going wrong.

If we knew p exactly, the variance for the expected number bankruptcies in 2014 would be Np(1-p), which we would estimate as ##N\hat p(1-\hat p)##. But we don't know p exactly, so we have to add another variance component to account for that. I have left it as an exercise for you to suggest what that extra variance should be.

anonymousk · Nov 23, 2013

Specify an estimate for P_C.

[itex]\widehat{P}[/itex] I've got right. 0,02866

Specify an estimate for the expected number of bankruptcies in the C-sector during 2014.

Estimate for the expected number bankruptcies is the 3 year average, [itex]\overline{X}[/itex]=920

Calculate the variance of the estimator of P_C.

32100*0,02866(1-0,02866) = 893,6192/3 = 297,87

Calculate the variance of the estimator of the expected number of bankruptcies in the C-sector during 2014.

This is where I'm confused.

I divided 893,6192 with 32100² = 0,00000087/3 = 0,00000029 (which is the variance of the expected number bankruptcies for 2014

But we don't know p exactly, so we have to add another variance component to account for that. I have left it as an exercise for you to suggest what that extra variance should be.

Hmm, I think what you are referring to is dividing by 3, because of the interval we get our numbers from.

haruspex · Nov 23, 2013

anonymousk said:

Calculate the variance of the estimator of PC.

32100*0,02866(1-0,02866) = 893,6192/3 = 297,87

No. I don't understand why you keep trying to apply that formula to get the variance in the estimator of the probability. That formula, Np(1-p), gives the variance in the observed number of bankruptcies if there are N in the sample and the probability is p. I gave you the correct formula in the previous post and elsewhere.

Calculate the variance of the estimator of the expected number of bankruptcies in the C-sector during 2014.

I divided 893,6192 with 32100² = 0,00000087/3 = 0,00000029 (which is the variance of the expected number bankruptcies for 2014

That's obviously much too low. I don't understand that calculation. This is where you use the Np(1-p) formula. As I posted earlier today:

If we knew p exactly, the variance for the expected number bankruptcies in 2014 would be Np(1-p), which we would estimate as Np^(1−p^).

But we don't know p exactly, so we have to add another variance component to account for that. I have left it as an exercise for you to suggest what that extra variance should be.

Hmm, I think what you are referring to is dividing by 3, because of the interval we get our numbers from.

We have computed the variance in our estimator for p. If we knew p exactly we would estimate the number of bankruptcies in 2014 as Np. If our value for p is wrong by an amount x, how much would that displace our estimate for the number of bankruptcies in 2014?

anonymousk · Nov 23, 2013

No. I don't understand why you keep trying to apply that formula to get the variance in the estimator of the probability. That formula, Np(1-p), gives the variance in the observed number of bankruptcies if there are N in the sample and the probability is p. I gave you the correct formula in the previous post and elsewhere.

0,02866(1-0,02866)/(3*32100) = 0,000000289
This is the variance of the estimator of P. Again a really low number. Thinking of giving up on this assignment, because it simply doesn't catch.

That's obviously much too low. I don't understand that calculation. This is where you use the Np(1-p) formula. As I posted earlier today:

32100*0,02866(1-0,02866)/3 = 297.. This number makes sense .. so the standard deviation from the mean for a single year would be 17,25.

We have computed the variance in our estimator for p. If we knew p exactly we would estimate the number of bankruptcies in 2014 as Np. If our value for p is wrong by an amount x, how much would that displace our estimate for the number of bankruptcies in 2014?

Hmm I am thinking it would displace our number by 'variance of estimator for p'.

haruspex · Nov 24, 2013

anonymousk said:

0,02866(1-0,02866)/(3*32100) = 0,000000289
This is the variance of the estimator of P. Again a really low number. Thinking of giving up on this assignment, because it simply doesn't catch.

It may be low, but hey, it's based on 96300 datapoints, so it should be low. And remember this is the variance, which is the square of the standard deviation. As s.d. it's 0,00053, which is somewhat less than p, but not hugely less. That's as you'd expect.

32100*0,02866(1-0,02866)/3 = 297.. This number makes sense .. so the standard deviation from the mean for a single year would be 17,25.

Hmm I am thinking it would displace our number by 'variance of estimator for p'.

Not quite. Our estimate for number in 2014 will be ##N\hat p##. If we had known p exactly we would have estimated ##N p##. The difference is ##N(\hat p - p)##. The variance that results is therefore ##E((N(\hat p - p))^2) = N^2E((\hat p - p)^2) = N^2 Var(\hat p) = N^2 p(1-p)/(NY) = (N/Y)p(1-p)##. We already had a variance Np(1-p) even if we knew p. Add them.

anonymousk · Nov 24, 2013

(N/Y)p(1−p) = (32100/3)*0,02866(1-0,02866) = 297,87.

These are the same results as before, so not sure where I went wrong.

32100*0,02866(1-0,02866)/3 = 297

I think I need to sleep on it, since it's 5am here, hehe. I appreciate your help and patience with me! :)

haruspex · Nov 24, 2013

anonymousk said:

(N/Y)p(1−p) = (32100/3)*0,02866(1-0,02866) = 297,87.

These are the same results as before, so not sure where I went wrong.

Sorry, I missed an error in your post #20.
You wrote:

32100*0,02866(1-0,02866)/3 = 297.. This number makes sense .. so the standard deviation from the mean for a single year would be 17,25.

This was in response to my advice "This is where you use the Np(1-p) formula." But note I did not write any "/Y" in there, so you should not have divided by 3.

You should have these two contributions to the total variance in our estimate of the 2014 bankruptcies:
- a component due to the uncertainty in ##\hat p##. In my preceding post I showed this is Np(1-p)/Y.
- a component from random fluctuations, even if we knew p exactly. This is Np(1-p) (not Np(1-p)/Y; if we know p exactly it cannot matter how many years of data we collected.)
Since these two sources are independent, we can add the variances.

anonymousk · Nov 24, 2013

haruspex said:

Sorry, I missed an error in your post #20.
You wrote:

This was in response to my advice "This is where you use the Np(1-p) formula." But note I did not write any "/Y" in there, so you should not have divided by 3.

Okay, this leaves me with 32100*0,02866(1-0,02866) = 893,6192

haruspex said:

You should have these two contributions to the total variance in our estimate of the 2014 bankruptcies:
- a component due to the uncertainty in ##\hat p##. In my preceding post I showed this is Np(1-p)/Y.
- a component from random fluctuations, even if we knew p exactly. This is Np(1-p) (not Np(1-p)/Y; if we know p exactly it cannot matter how many years of data we collected.)
Since these two sources are independent, we can add the variances.

Sorry, but I don't feel like I'm getting any closer anymore at this point, even with these good hints you throw at me, since I've been sitting with this assignment for a whole week and still not finished it.

I just can't see how I can add the "two contributions" to find the variance of number bankruptcies of 2014.

One thing that confused me was

Yes, you have to divide by √3. I was trying to avoid appeal to the authority of such formulae because I felt it would not give you any further insight into what happens. So FWIW I'll explain the path I was on:

Thought I would be using Sampling Theorem: "the standard deviation of the distribution of sample means is the population standard deviation divided by √sample size.", where our sample size was 3.

haruspex · Nov 24, 2013

anonymousk said:

Okay, this leaves me with 32100*0,02866(1-0,02866) = 893,6192

Yes.

I just can't see how I can add the "two contributions" to find the variance of number bankruptcies of 2014.

If two random variables are independent then var(A+B) = var(A) + var(B). The uncertainty in estimate of p is independent of the inherent variation in actual bankruptcies year to year.

Thought I would be using Sampling Theorem: "the standard deviation of the distribution of sample means is the population standard deviation divided by √sample size.", where our sample size was 3.

Yes, but what do you think the 'population' is in this context? I thought it might be rather non-obvious - but that probably makes it a good discussion to have.

anonymousk · Nov 24, 2013

I'm so confused about this assignment. To my defense, I'm not this bad at the other courses (micro economics etc). It's just statistics I just can't seem to be able to wrap my mind around. I still haven't done the rest of the assignment, c), d) and e), but I think I've got the confidence intervals under control. It's the interval (2009-2011) that throws me out of course i think, hehe.

So variance for the estimate of P was 0,000000289. So far so good, hehe.

Yes.

So the variance for 2014 is 893,6192?

Yes, but what do you think the 'population' is in this context? I thought it might be rather non-obvious - but that probably makes it a good discussion to have.

Aah.. population is the 96300 datapoints. But I thought that the calculated variance for 2014, 893,6192, had to be divided by the sample size, since the average, 920, was estimated from an average of 3 year interval. Thus dividing by 3 and getting a more concentrated variance.

Edit:

Specify the parameters of the approximate normal distribution.

So a normal distribution has two parameters, the mean and the standard deviation.
The mean in this case is the observed 920 bankruptcies. The standard deviation however. Do I use the standard deviation for the expected number bankruptcies, (√893,6192), or do I use them both somehow? The variance for the estimate as well the variance for expected bankruptcies in 2014?

anonymousk · Nov 24, 2013

Ohhhh, I got it..

So a 95% confidence interval tells us, that we are 95% confident that the true population proportion is between these 2 numbers (2.76% to 2.97%). It's the range/uncertainty of our estimator [itex]\widehat{P}[/itex]=2,86%

just unsure how to calculate the CI for expected number bankruptcies, since I believe i still haven't gotten the variance of it right.. unless it was the 32100*0,02866(1-0,02866) = 893,6192, but shouldn't i be using the Sampling Theorem here aswell?

I get a CI of [861,41 ; 978,59] if the variance of our expected number bankruptcies in 2014 is 893,62. Seems to be a quite big interval.

haruspex · Nov 24, 2013

anonymousk said:

So the variance for 2014 is 893,6192?

No, that's just the Np(1-p), which is the variance it would have if by some magic we knew p exactly. We have to add the variance that comes from our uncertainty regarding p. If you scan back through the posts you'll see I gave you the formula for this. It doesn't change the variance hugely, but it is significant.

Aah.. population is the 96300 datapoints.

No. For the purposes of applying that theorem, your random variable is the number of bankruptcies in a year. You had three sample years, so the sample population size is 3. Hence the 1/√3.

But I thought that the calculated variance for 2014, 893,6192, had to be divided by the sample size, since the average, 920, was estimated from an average of 3 year interval. Thus dividing by 3 and getting a more concentrated variance.

That's the other component. You need to get your head around that there are two separate causes of variance for what will actually happen in a future year:
- Even if we knew p exactly, random fluctuation will give a variance of Np(1-p). There is no divide by 3 here since we know p, so it doesn't matter how many years of data we have that might be a basis for an estimate of p.
- Because we don't know p exactly, there's another component. This one does depend on the number of years of data because the more years we have the less uncertainty in p.

So a normal distribution has two parameters, the mean and the standard deviation.
The mean in this case is the observed 920 bankruptcies. The standard deviation however. Do I use the standard deviation for the expected number bankruptcies, (√893,6192), or do I use them both somehow?

We're not taking it to be a normal distribution, but let that pass.
The mean of numbers of bankruptcies in prior years is our best estimate for the number in the coming year. The standard deviation (the square root of the variance) is a guide to how confident we are in our estimate.

The variance for the estimate as well the variance for expected bankruptcies in 2014?

You've lost me. What's the difference between the estimate for 2014 and the expected bankruptcies for 2014?

anonymousk · Nov 24, 2013

haruspex said:

No, that's just the Np(1-p), which is the variance it would have if by some magic we knew p exactly. We have to add the variance that comes from our uncertainty regarding p. If you scan back through the posts you'll see I gave you the formula for this. It doesn't change the variance hugely, but it is significant.

I believe you're referring to this one ##(N/Y)p(1-p)##

And I did (N/Y)p(1−p) = (32100/3)*0,02866(1-0,02866) = 297,87. That's how I understand the formula, which was wrong, hmm.

We're not taking it to be a normal distribution, but let that pass.
The mean of numbers of bankruptcies in prior years is our best estimate for the number in the coming year. The standard deviation (the square root of the variance) is a guide to how confident we are in our estimate.

Question c) goes
Explain that the number of bankruptcies in the C-sector during 2014 can be assumed to be approximately normally distributed.
Specify the parameters of the approximate normal distribution.

I read that approximation from binomial to normal can be met if nP(1-P)>5 (which it is in this case). It's specifying the parameters of the approximate normal distribution I was confused about.

You've lost me. What's the difference between the estimate for 2014 and the expected bankruptcies for 2014?

I meant the variance for [itex]\widehat{P}[/itex] and the variance for the expected bankruptcies.

What I meant was if we were to add the two variances to get the variance for the expected number of 2014.
32100*0,02866(1-0,02866) = 893,6192
0,02866(1-0,02866)/(3*32100) = 0,000000289

893,6192+0,000000289

haruspex · Nov 24, 2013

anonymousk said:

I believe you're referring to this one ##(N/Y)p(1-p)##

And I did (N/Y)p(1−p) = (32100/3)*0,02866(1-0,02866) = 297,87. That's how I understand the formula, which was wrong, hmm.

Wrong? Np(1-p) and (N/Y) p(1-p) are both valid for the things they're valid for

.

Question c) goes
Explain that the number of bankruptcies in the C-sector during 2014 can be assumed to be approximately normally distributed.
Specify the parameters of the approximate normal distribution.

I read that approximation from binomial to normal can be met if nP(1-P)>5 (which it is in this case). It's specifying the parameters of the approximate normal distribution I was confused about.

OK, I missed that context.

I meant the variance for [itex]\widehat{P}[/itex] and the variance for the expected bankruptcies.

What I meant was if we were to add the two variances to get the variance for the expected number of 2014.
32100*0,02866(1-0,02866) = 893,6192
0,02866(1-0,02866)/(3*32100) = 0,000000289

No, we're not adding a variance for a number of bankruptcies to a variance for a probability! That would make no sense. We want to add two variance components for the number of bankruptcies, but one is a consequence of a variance for a probability. So it isn't p(1-p)/N or p(1-p)/NY. It will have the same order of magnitude as N. Neither is in itself the variance of the expected number in 2014. You have to add them to get that.
Sorry, but I don't feel you're reading my posts very carefully. I keep having to make the same points. I understand you're a bit fed up with this assignment. Would you prefer to do something else for a while and go back through the thread when you've more time? I'll still be here

.

A statistics assignment (Binomial distribution)

Homework Help Overview

Discussion Character

Approaches and Questions Raised

Discussion Status

Contextual Notes

Homework Statement

Homework Equations

The Attempt at a Solution

Similar threads

Distance between a Clock's hands when the distance is increasing most rapidly

Polar integral

Deriving spatial derivatives

Is this the correct general solution of the given PDE?

J_1(x) = (x^2/10)*(J_1(x) + J_3(x)) How to solve?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect