# Derivation of Bernoulli Binomial distribution

## Homework Statement

Derive the bernoulli binomial distribution.

## The Attempt at a Solution

Each bernoulii trial could have only two possible outcomes .
Let’s name one outcome as success and another outcome as failure.
Let’s denote the probability of getting success and failure in each Bernoulli trial by p and q respectively.

Clearly, q = 1-p

For n independent Bernoulli trials, let’s denote the probability of getting k successes by P (k, n).

The probability of getting a success in each trial is p.
So, the probability of getting k sucesses in k trials is ## p^k##.

The probability of getting k sucesses in n independent Bernoulli trials is = probability of getting k sucesses and n-k failures in n independent Bernoulli trials.

The probability of getting k sucesses in k independent Bernoulli trials and getting n – k failures in n-k independent Bernoulli trials are ## p^k~ and~ (1-p)^{(n-k)} ## respectively.

Let’s consider the following events.

Event A : getting k successes in the 1st k trials of n independent Bernoulli trials

Event B : getting n- k failures in the next n- k trials of n independent Bernoulli trials

Now, the probability of getting both events A and B is ## p^k (1-p)^(n-k) ##

But, according to the problem the k successes could be in any k trials of the n independent Bernoulli trials. It is not necessary that these k trials should be the 1st k trials of the n independent Bernoulli trials.

So, the events corresponding to the problem is,

Event C: getting k successes in any of the k trials of n independent Bernoulli trials

Event D: : getting n- k failures in the rest of the n- k trials of n independent Bernoulli trials

Now, the probability of getting both events C and D is what the question is asking.

For event C,

In how many ways can I get any of the k trials out of n independent trials?

This is ##\binom n k## i.e. choosing k boxes out of n boxes.

After being chosen a way of getting k trials out of n independent Bernoulli trials , the probability of getting k successes in these chosen k trials is ## p^k##.

After being chosen the k trials, there is only one way of getting n – k trials.

So, after being chosen the k trials, the probability of getting k sucesses in these trials and n-k failures in the rest of n-k trials is ## p^k (1-p)^{(n-k)} ##.

Since, there are ##\binom n k## of choosing k trials and for each choice, the probability of getting k sucesses in these trials and n-k failures in the rest of n-k trials is ## p^k (1-p)^{(n-k) } ##,
the probability of getting k sucesses in any of the k trials of n independent Bernoulli trials and n-k failures in the rest of n-k trials is = sum of the probability of getting k sucesses in these trials and n-k failures in the rest of n-k trials i.e.## p^k (1-p)^{(n-k)} ## for ##\binom n k## times.

Hence, the probability of getting k sucesses out of n independent Bernoulli trials is =P(k,n) = ##\binom n k p^k (1-p)^{(n-k)}##

Is this correct?

Ray Vickson
Science Advisor
Homework Helper
Dearly Missed

## Homework Statement

Derive the bernoulli binomial distribution.

## The Attempt at a Solution

Each bernoulii trial could have only two possible outcomes .
Let’s name one outcome as success and another outcome as failure.
Let’s denote the probability of getting success and failure in each Bernoulli trial by p and q respectively.

Clearly, q = 1-p

For n independent Bernoulli trials, let’s denote the probability of getting k successes by P (k, n).

The probability of getting a success in each trial is p.
So, the probability of getting k sucesses in k trials is ## p^k##.

The probability of getting k sucesses in n independent Bernoulli trials is = probability of getting k sucesses and n-k failures in n independent Bernoulli trials.

The probability of getting k sucesses in k independent Bernoulli trials and getting n – k failures in n-k independent Bernoulli trials are ## p^k~ and~ (1-p)^{(n-k)} ## respectively.

Let’s consider the following events.

Event A : getting k successes in the 1st k trials of n independent Bernoulli trials

Event B : getting n- k failures in the next n- k trials of n independent Bernoulli trials

Now, the probability of getting both events A and B is ## p^k (1-p)^(n-k) ##

But, according to the problem the k successes could be in any k trials of the n independent Bernoulli trials. It is not necessary that these k trials should be the 1st k trials of the n independent Bernoulli trials.

So, the events corresponding to the problem is,

Event C: getting k successes in any of the k trials of n independent Bernoulli trials

Event D: : getting n- k failures in the rest of the n- k trials of n independent Bernoulli trials

Now, the probability of getting both events C and D is what the question is asking.

For event C,

In how many ways can I get any of the k trials out of n independent trials?

This is ##\binom n k## i.e. choosing k boxes out of n boxes.

After being chosen a way of getting k trials out of n independent Bernoulli trials , the probability of getting k successes in these chosen k trials is ## p^k##.

After being chosen the k trials, there is only one way of getting n – k trials.

So, after being chosen the k trials, the probability of getting k sucesses in these trials and n-k failures in the rest of n-k trials is ## p^k (1-p)^{(n-k)} ##.

Since, there are ##\binom n k## of choosing k trials and for each choice, the probability of getting k sucesses in these trials and n-k failures in the rest of n-k trials is ## p^k (1-p)^{(n-k) } ##,
the probability of getting k sucesses in any of the k trials of n independent Bernoulli trials and n-k failures in the rest of n-k trials is = sum of the probability of getting k sucesses in these trials and n-k failures in the rest of n-k trials i.e.## p^k (1-p)^{(n-k)} ## for ##\binom n k## times.

Hence, the probability of getting k sucesses out of n independent Bernoulli trials is =P(k,n) = ##\binom n k p^k (1-p)^{(n-k)}##

Is this correct?
Yes, it is correct.

Thanks.

Orodruin
Staff Emeritus
Science Advisor
Homework Helper
Gold Member
Essentially, albeit a bit convoluted.

Essentially, albeit a bit convoluted.
I didn't summarize the steps because I used to forget the derivation. This time I did it as my intuition led me.

I am not understanding, <k>.
The book gives <k> = np
k is the number of sucesses which we want in n independent Bernoulli trials.

So, does <k> mean average number of sucesses we can get in the n independent Bernoulli trials?

Is it that in the 1st trial we get 0 success, in the 2nd trial we get one success, in the 3rd trial we get 0 one success ; i.e. we took the outcomes of n trials, sum them and divide them by n. This gives the average number of success in n trials.

<k> = ## \Sigma_{k=0 } ^ { n} {k P(k,n)} ##, where P(k,n) = ##\binom n k p^k (1-p)^{(n-k)}##

But, then how does <k> = np?

Last edited:

I thought of Binomial distribution as the probability of getting k sucesses out of n independent Bernoulli trials.

I am not understanding how to get the following understanding from the above understanding:
The Binomial distribution is the sum of n independent Bernoulli trials.

Let’s associate a random variable with each trial. We have n random variables ## X_1, X_2, …, X_n ##.
Now, these variables are outcome of each corresponding trial.
When the outcome is success, the variable has the value 1 and when the outcome is failure , the variable has the value 0.
Then, k could be written as the sum of these variables.

So, <k> = ## < X_1 + X_2+…+X_n> ##
Since the variables are independent, ## < X_1 + X_2+…+X_n> = < X_1> +<X_2> +… + <X_n>##

Now,## <X_n> = 1*p + 0*q = p##

So, <k> = np.

I derived this with the help of the following idea:
When the outcome is success, the variable has the value 1 and when the outcome is failure , the variable has the value 0.
But, I do not know why I should take the help of this idea.

Why is it necessary to know that <k> = np?
Does it have any physical significance? I mean is there anything here to understand or is this just a mathematical calculation?

#### Attachments

• upload_2017-12-2_16-44-1.png
16.6 KB · Views: 800
Summary of derivation of Binomial distribution
If p is the probability of a win, then p^k is the probability of winning k times in a row.
If you have n trials and only win k times, then you lose the rest (n-k) of te trials. So the probability of winning the first k and then losing the rest would be ##p^k(1-p)^{n-k}##.
Now comes the combination part.
As I said, the probability of winning the first k and losing the rest is that piece of the formula. But the probability would be the same if you lost the first n-k and won the last k.
In all, you have to add up all the possible ways to win k times out of n. Each way has the same probability, so the total probability of winning k times is the probability of one of the ways times the number of combinations of k wins and n-k losses.

Is there anyway to visualize standard deviation also? Or is it, too, just a mathematical calculation?
How does it matter whether the standard deviation for a given data is more or less?

I have not understood the following part.
Will you please put some light upon the following:

Why is it said that the standard deviation tells us the width of a distribution?

#### Attachments

• upload_2017-12-2_18-13-13.png
46.3 KB · Views: 433
• upload_2017-12-2_18-13-47.png
46.3 KB · Views: 739
Last edited:
Orodruin
Staff Emeritus
Science Advisor
Homework Helper
Gold Member
So, does <k> mean average number of sucesses we can get in the n independent Bernoulli trials?
It is the expected number of successful trials.

But, then how does <k> = np?
If you compute the sum, you will find ##np##. The easier way of seeing it is by linearity of expectation value. Each trial gives an expectation value of ##p## and you have ##n## trials.

Why is it necessary to know that <k> = np?
It is a property of the distribution, just as its variance. Whether it is necessary to know it or not depends on what information you want to extract from the distribution.

Is there anyway to visualize standard deviation also?
The standard deviation (or more conveniently, its square - the variance) is a measure of the spread in the distribution, i.e., how much the result will typically differ from the expectation value.

Ray Vickson
Science Advisor
Homework Helper
Dearly Missed
Is there anyway to visualize standard deviation also? Or is it, too, just a mathematical calculation?
How does it matter whether the standard deviation for a given data is more or less?

I have not understood the following part.
Will you please put some light upon the following:
View attachment 215952

Why is it said that the standard deviation tells us the width of a distribution?

Because that is exactly what it does.

You can have two different distributions with the same mean but different standard deviations, and it is handy to have some (at least crude) way to speak about some of their differences in numerical terms. If one distribution is more "spread out" than another it has a higher standard deviation. The simplest example would be when comparing the two distributions Unif(-1,1) and Unif(-5,5). The first distribution describes outcomes that are uniformly distributed between -1 and +1, while the second between -5 and +5. If you draw a random sample from the first distribution your values would always lie sprinkled between -1 and +1, but a sample from the second distribution will have some outcomes near +5 or -5.

The fact that the mean increases linearly in ##n## but the standard deviation increases like ##\sqrt{n}## is what makes the world possible! Looking at averages ##\bar{X} =(1/n) \sum_1^n X_i##, if the individual random terms have mean ##\mu## and standard deviation ##\sigma##, the average has mean ##\mu## but standard deviation ##\sigma/\sqrt{n}##. For very, very large ##n## the average ##\bar{X}## is "almost non-random", very much like a deterministic quantity. That is why Physics works, and is why life is possible in the universe: the huge number of atomic particles undergoing their random motions look "organized" on the macro scale of everyday life. We don't see all the underlying randomness, and that is good because it allows events to happen predictably; it makes the cells in our bodies behave as they should and it makes it possible for the world to exist.

Pushoam
Ray Vickson
Science Advisor
Homework Helper
Dearly Missed
View attachment 215946
I thought of Binomial distribution as the probability of getting k sucesses out of n independent Bernoulli trials.

I am not understanding how to get the following understanding from the above understanding:
The Binomial distribution is the sum of n independent Bernoulli trials.

Let’s associate a random variable with each trial. We have n random variables ## X_1, X_2, …, X_n ##.
Now, these variables are outcome of each corresponding trial.
When the outcome is success, the variable has the value 1 and when the outcome is failure , the variable has the value 0.
Then, k could be written as the sum of these variables.

So, <k> = ## < X_1 + X_2+…+X_n> ##
Since the variables are independent, ## < X_1 + X_2+…+X_n> = < X_1> +<X_2> +… + <X_n>##

Now,## <X_n> = 1*p + 0*q = p##

So, <k> = np.

I derived this with the help of the following idea:
When the outcome is success, the variable has the value 1 and when the outcome is failure , the variable has the value 0.
But, I do not know why I should take the help of this idea.

Why is it necessary to know that <k> = np?
Does it have any physical significance? I mean is there anything here to understand or is this just a mathematical calculation?

When you ask for the number of heads in 20 coin tosses, you are asking for the number of heads on toss 1 + the number of heads on toss 2 + ... + the number of heads on toss 20. For each toss that number is either 0 (if you get a tail) or 1 (if you get a head).

As to why it is necessary to know ##\langle k \rangle = np##: well, that just tells you the expected number of "successes". If I toss a pair of fair dice, the probability of getting a '7' on any toss is p = 1/6. If I toss a pair of dice 600 times I would expect to get a number of '7's near 600/6 = 100. In any actual experiment (consisting of tossing dice 600 times) the number of '7's will likely be different from 100 most of the time, but not all that different: sometimes higher, sometimes lower but hovering around 100. If you were asked to bet on the number of '7's in 600 tosses, the number 100 would be your best guess. Over the long run you would win the bet more often by picking 100 than by picking any other number. (Admittedly, you would not win very often but you would win even less often if you picked a number other than 100.)

Last edited:
Pushoam
Orodruin
Staff Emeritus
Science Advisor
Homework Helper
Gold Member
If you were asked to bet on the number of '7's in 600 tosses, the number 100 would be your best guess.
Just to add that the expectation value is not always the best guess. If I flip 101 coins, 50.5 is a (very) bad guess for the number of heads.

Pushoam
Just to add that the expectation value is not always the best guess. If I flip 101 coins, 50.5 is a (very) bad guess for the number of heads.

Do you mean that one has to use common sense, too, before declaring one's guess?
But, then the best guess should be close to the mean and physically possible. Right?
So, the best guess, here, would be 50 or 51. Right?

How best our guess would be is determined by the fractional width of the distribution i.e.## \frac{\sigma_k}{< k>}##. Right? The smaller the fractional width, the better the guess.

Last edited:
Orodruin
Staff Emeritus
Science Advisor
Homework Helper
Gold Member
Do you mean that one has to use common sense, too, before declaring one's guess?
But, then the best guess should be close to the mean and physically possible. Right?
So, the best guess, here, would be 50 or 51. Right?
You need to know the distribution. If the distribution is a sum of several independent variables (as in this case) the clt tells you it will be approximately Gaussian and therefore peaked near the expected value. However, for a general distribution this may not be the case.

You need to know the distribution. If the distribution is a sum of several independent variables (as in this case) the clt tells you it will be approximately Gaussian and therefore peaked near the expected value. However, for a general distribution this may not be the case.
Do you mean that a distribution is a variable as you are saying the distribution is a sum of several independent variables?
What I meant by distribution function is the probability of getting k where k is sum of several independent variables and the distribution is the graph of the graph of the probability of getting k wrt. k. Is this correct?

What is clt here?
For any value of n, the probability that the k will be equal to the expected value is maximum. But, if the expected value is physically impossible as in the example you gave, then the best guess will be that value which is physically possible and closest to the mean value. Is this correct?

When you ask for the number of heads in 20 coin tosses, you are asking for the number of heads on toss 1 + the number of heads on toss 2 + ... + the number of heads on toss 20. For each toss that number is either 0 (if you get a tail) or 1 (if you get a head).
So, the variable ## X_i ## does not represent the outcome of ith trial. It represents the no. of success in the ith trial.
In the following:
Let’s associate a random variable with each trial. We have n random variables ##X_1, X_2, …, X_n ##.
Now, these variables are outcome of each corresponding trial.
When the outcome is success, the variable has the value 1 and when the outcome is failure , the variable has the value 0.
Then, k could be written as the sum of these variables.

So, <k> = ## < X_1 + X_2+…+X_n>##
this
Now, these variables are outcome of each corresponding trial.
is wrong.