# I Feynman's probability lecture -- a few questions

Tags:
1. Jul 5, 2016

I am reading lecture #6 on the first volume of "feynman's lectures on physics", and I understood quite well the first half of the lecture. However after he proved that D(rms)= √D^2= √N, I started to lose him, and so I have quite a few questions:

1) "The variation of Nh from its expected value N/2 is Nh-N/2=D/2. The rms deviation is (Nh-N/2)rms=√N/2"
I understand that the expected value for the difference between the number of heads and tails (AKA the distance traveled) is √N, and that the expected value for the number of heads is N/2. But how from these 2 facts did he conclude that the number of heads is expected to deviate from N/2 by √N/2 is what I do not understand. And further more, is that not a paradoxical conclution that on one hand you expect the number of heads to be N/2, and on the other hand you expect the difference between the number of heads and tails to be √N?

2) "According to our result for Drms, we expect that the “typical” distance in 30 steps ought to be √30=5.5, or a typical k should be about 5.5/2=2.8 units from 15. We see that the “width” of the curve in Fig. 6–2, measured from the center, is just about 3 units, in agreement with this result."
By "typical" does he mean the average? And I don't understand the "width of the curve measured from the center" part. what does he mean by the center? center of what? and what does the width of the curve from said center tell us? how does it confirm our findings?

3) "We also expect an actual Nh to deviate from N/2 by about √N/2, or the fraction to deviate by 1/N*√N/2=1/2√N"
Again, how did he come to that conclusion? It seems that he devided the deviation by N because Nh is also devided by N to reach the fraction. But does deviding the expected result of an experiment by a constant also means that the deviation will also be devided by said constant?

4) "Let us define P(x,Δx) as the probability that D will lie in the interval Δx located at x (say from x to x+Δx). We expect that for small Δx the chance of D landing in the interval is proportional to Δx, the width of the interval. So we can write P(x,Δx)=p(x)Δx."
The 2nd sentence confuses me. I understand that the chance of D landing in an interval will grow proportionally to the size of the interval, the bigger the interval, the bigger the chance. But I do not understand what is the meaning of the last equation: The probability of D being in interval Δx= the probability of X times the interval? I know p(x) is the probability density function but I don't understand he derived it from this equation.

5) "We plot p(x) for three values of N in Fig. 6–7. You will notice that the “half-widths” (typical spread from x=0) of these curves is √N, as we have shown it should be."
This is probably a result of me not understanding what I asked in question 2, but what does he mean by the "typical spread from x=0", for example in N=10,000 the half width of the curve is 100 near the top, so how is it typical?

6) "You may notice also that the value of p(x) near zero is inversely proportional to √N" how can the value of p(x) near 0 be inversly proportional to √N if N is a constant? and if he means that the curve gets wider as N gets bigger, why is it inversly proportional to √N and not just N?
"Since the curves in Fig. 6–7 get wider in proportion to √N, their heights must be proportional to 1/√N", again, why not just N?

Well I asked more than I planned, so if you actually take the time to answer all these you have my deepest gratitude.

2. Jul 5, 2016

### micromass

There are a lot of questions. I will try to answer them one by one. So I suggest we discuss the first one and then we move on.
He didn't exactly tell us how to interpret the rms deviation, so I understand your trouble.

Let us first focus on something completely different to illustrate the concepts. Let's say I'm teaching a class of $5$ students. I measure the heights of the students in meters and I get the following numbers: $160$, $171$, $174$, $182$, $200$.

The first question I can ask is how tall is the average student. Clearly, this is just the average:

$$\frac{160 + 171 + 174 + 182 + 200}{5} = 177.4$$

First of all, notice that none of the students actually is that tall. So the interpretation of an average is difficult. Indeed, a famous joke says that the average human has one testicle, even though there are very few humans for which this actually is the case!

So how would we interpret the average? The idea comes from a game (as does most of probability). Player A (which is usually nature) chooses one of the $5$ students at random. You must guess the height. If the height is correctly, then you lose no money. If they height is incorrect, then you lose an amount proportional to how wrong you were. That is, if you guesses $160$ and the answer is $171$, then you lose $11$ dollars. What should you do to minimize your loss? The mathematics tells us that you should choose the average. Choosing the average means that you will always lose money, but you won't lose as much as any other choice.

But clearly, not all students are as tall. Some students are taller, some are shorter. So we can ask how much the students vary from the "ideal height, i.e. the mean". Doing this, we get:

$$-17.4, -6.4, -3.4, 4.6, 22.6$$

So some students are shorter (negative), others are taller (positive). We can take the average of these numbers. The interpretation of this is that this is the average distance from the average. The average is zero. Check it. It makes a lot of sense too.

But we might be interested in how much variation there is between the students. So on average, how much do the students differ from the average. Clearly $0$ is not the answer we want. We only got $0$ because we took into account that some students are negative and some are positive. If we don't take this into account, then we get the following distances from the mean:

$$17.4, 6.4, 3.4, 4.6, 22.6$$

The average of this is $10.88$. So what this means is that on average, the students are expected to have $10.88$ distance from the mean: some are closer than $10.88$, some are further away. So this gives us information on the spread of the students.

Mathematically however, it is much easier not to do the previous, but to take the square of the numbers, then to take the average, then to take the square root. This is the rms. So we first take the square of the numbers:

$$302.76, 40.96, 11.56 , 21.16 ,510.76$$

This also gets rid of the minus signs. Then we take the average which is $177.44$. Then we take the square root of this, which gives us the rms of $13.32$. This is a number which has the same interpretation as our obtained $10.88$, but which is mathematically more convenient. A very good analogy is that the distance between two coordinates $(x_1,y_1)$ and $(x_2,y_2)$ is $\sqrt{(x_1- x_2)^2 + (y_1 + y_2)^2}$ where we do something similar: squaring, then adding then square root.

So what does all of that mean for the coin tossing game? The average when throwing $N$ coins is $N/2$ heads and $N/2$ tails. This means that we would expect on average to have $N/2$ heads. If we do this many times, then we will see many times with less heads and many times with more heads. In order to find out how far the typical game can be from average, we again calculate some kind of spread by taking the rms: we compute the distances from the actual outcome to $N/2$, we average and we square root. This is what the $\sqrt{N}/2$ means. It means that on average, the distance between the average and the outcome is that large. Sometimes the outcome is less large, sometimes the outcome is larger.

Last edited: Jul 6, 2016
3. Jul 5, 2016

### Stephen Tashi

The lecture can be found online at: http://www.feynmanlectures.caltech.edu/I_06.html

He relies on equations 6.9 and 6.10. , which precede the passage you quoted ( eqs. 6.11, 6.12).

Yes, it might be a paradox if we are using the common language definition of "expected". However, in probability theory, the "expected" value of a random variable has a technical definition. Feynman is speaking as a physicist, not as a pure mathematician. Don't expect his arguments to be precise logical demonstrations. Indeed, the mathematical theory of probability does not offer any guarantees about what events in a probability space "actually" happen. Probability theory deals only with "the probability of" events. In applying probability theory to a specific situation, people make arguments about what will actually occur, which are not supported by any mathematical reasoning. The way this is often done is to mix the use of the word "expected" in the common language sense (indicating our anticipation of what will actually happen) with the mathematical use of the word "expected".

In abstract discussions of the foundations of physics written by physicists, the authors sometimes explicitly state that they are discussing "physical probability". I don't know if there is a standard definition of "physical probability", but it seems to amount to the assumption that if you do enough independent trials where the probability of success is p then the actual fraction of successes can be guaranteed to be arbitrarily near p. In contrast, the mathematical "law of large numbers" doesn't offer such a guarantee. It modestly speaks of "the probability of" the fraction of number of successes being arbitrarily near p.

Using the mathematical definition of "expected", there is no paradox is speaking of both the "expected value of a random variable X" and "the expected value of the deviation of the random variable X from its expected value".

Since the rms value in Feymnan's discussion is some constant (for a given N) we could also speak of a distribution giving the probabilities that the actual rms value computed from a sample deviates from that constant by various amounts.

I think so - if are using "average" in the mathematical sense of the word to denote the mathematical definition of "expected value".

I think he means X = 15 is the "center" of the x values.

I don't know why he thinks it is obvious that the "width" of the curve in fig 6-2 is about 3. Some people would say the "width" is the span from x = 5 to x = 25.

For the normal distribution, you can look at the two places where the graph changes from "concave up" to "concave down" and define the interval between those two places a the "width". For the graph of the binomial distribution in fig. 6-2, the location of such places isn't visually obvious to me.

4. Jul 6, 2016

First of, thank you for the great detailed replies, I really appreciate it.

Micromass: your explanation was great and I think I know why I was confused. By expected value, I thought he was saying that in a single game, the outcome with the highest chance of occuring in N/2, which is true, but not what he ment. From what I gathered, by "expected value" he ment that after a large number of games, we would expect the arithmatic mean of all the heads to be N/2.

However, like you explained, the arithmatic mean can be misleading, since some outcomes will be more than the mean and some will be less. He proved before that the expected value of the rms of the distance traveled (AKA the difference between the number of heads and tails) is √N. Meaning that for all games we would expect the difference between heads and tails to be √N (and so the difference between heads and N/2 is √N/2), but if we take the average of a large number of games, the difference between the mean of the heads and the mean of the tails will be 0, since the expected value is N/2.

at least this is what I gathered from your explanation, it makes more sense now. I would love to hear your explanation for the rest of the questions.

Last edited: Jul 6, 2016
5. Jul 6, 2016

Yes, that's where I am reading it. BTW this lecture was not given by Feynman, he was called out of the city at the time, which might explain the lack of detailed explenations.

I still don't get what he ment by "width of the curve from the center" part or what does it mean. The curve he is showing seems to concave from up to down in y=7.5, which seems to be where the width from the center i about 3. But how does it prove to us that the deviation from the expected outcome is √N/2 is what I don't understand.

Last edited: Jul 6, 2016
6. Jul 6, 2016

### micromass

Answer to $2$:

Right, this is stated very badly. It's too imprecise to really make sense of it. But I'll explain what he should have said. The $D_{rms}$ indeed specifies something about the curve. Here are curves with the same center but different $D_{rms}$ values:

So the $D_{rms}$ determines how peaked the curve is. If you read my answer to question (1) again this makes sense since it determines the average distance from the mean. If this average distance is low, then points will lie very close to the mean and the curve is very peaked. If the distance is high, then the points will be very dispersed.

I can quantify things a bit more:

This is a very famous result, it tells you how many observations you should expect closer then $D_{rms}$ to the mean, the answer is $68\%$. Closer then $2D_{rms}$ should lie $95\%$ of the observations and so on. If you check figure 6.2, then this means that about $68\%$ of the observations should lie within $2.8$ units of $15$, which visually seems plausible. About $95\%$ of the observations should lie within $5.6$ units of $15$, this also seem very likely satisfied. A more exact count should be done to see whether this is satisfied, but it looks ok to me.

7. Jul 7, 2016

Well this also answers my 2nd question pretty much perfectly. 4 more to go.

8. Jul 7, 2016

### micromass

Yes, it does. Note that we always have $\left\langle A\right\rangle + \left\langle B\right\rangle = \left\langle A+B\right\rangle$ and $\alpha\left\langle A\right\rangle = \left\langle\alpha A\right\rangle$ for $\alpha\in \mathbb{R}$.
Then
$$\left\langle\left(\frac{N_H}{N} - \frac{1}{2}\right)^2\right\rangle = \left\langle\frac{1}{N^2}\left(N_H -\frac{N}{2}\right)^2\right\rangle = \frac{1}{N^2}\left\langle\left(N_H - \frac{N}{2}\right)^2\right\rangle$$
So
$$\left(\frac{N_H}{N} - \frac{1}{2}\right)_{rms} = \sqrt{\left\langle\left(\frac{N_H}{N} - \frac{1}{2}\right)^2\right\rangle } = \frac{1}{N} \sqrt{\left\langle\left(N_H - \frac{N}{2}\right)^2\right\rangle} = \frac{1}{N}\left(N_H - \frac{N}{2}\right)_{rms} = \frac{1}{N} \frac{\sqrt{N}}{2}$$

9. Jul 7, 2016

Fair enough.
I also have a question regarding the answer to question #1: we know that the expected value of the number of heads is N/2, and the expected value for the deviation is √N/2. Does it mean that for a single game of, let's say 100 tosses, we expect the number of heads to be 50 but also expect the actual answer to deviate by 5? Meaning that if we were forced to pick a single outcome with the highest chance we would pick 50, but if we were not we would pick 50 +/- 5? yet if that is the case there is a higher chance of getting 50 +/- 1, since there is a higher probability of getting 49/51 than 45/55, than have come that for a single game we still expect the answer to deviate by 5 and not 1?

10. Jul 7, 2016

### micromass

No, if you're not allowed to pick $50$, then $51$ and $49$ have the highest chance.

As an example, I did an experiment where I tossed a coin 100 times and denoted the number of heads. Then I did this experiment a thousand times. Here are the results:

$$\begin{array}{||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c||} \hline \hline 34 & 35 & 36 & 37 & 38 & 39 & 40 & 41 & 42 & 43 & 44 & 45 & 46 & 47 & 48 & 49 & 50 & 51 & 52 & 53\\ \hline 5 & 10 & 17 & 26 & 45 & 76 & 128 & 153 & 219 &306 &349 &467 &574 &661 &749 &814 &811 &801 &732 &662\\ \hline \hline 54 & 55 & 56 & 57 & 58 & 59 & 60 & 61 & 62 & 63 & 64 & 65 & 66 & 68\\ \hline 589 &499 &376 &325 &174 &148 &114 & 67 & 49 & 29 & 12 & 7 & 5 & 1\\ \hline \hline \end{array}$$

As you see, the most popular outcome is $49$ which happened $813$ times. After that the most popular outcomes are $50$ and $51$. You can see from this table, that the closer you are to the center (i.e. $50$), the more likely you are. This is characteristic for a symmetric distribution.

We can calculate the mean, which is $49.9817$. To compute the spread, we take for each outcome the distance to $50$. We get

$$\begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 & 16 & 18 \\ \hline 811 & 1615 & 1481 & 1323 & 1163 & 966 & 725 & 631 & 393 & 301 & 242 & 143 & 94 & 55 & 29 & 17 & 10 & 1\\ \hline \end{array}$$

The most popular deviation from $50$ is $1$. This makes sense since we get contributions from both $49$ and $51$ which are very popular. The mean however is $3.9323$. So the mistake you're making is that somehow the mean is equal to the most popular outcome. It's not. The mean could be interpreted as the bet you need to make in order to lose the least money if the bet is as follows: the difference between the outcome and the bet you'll make is the amount of money you make. For example: if you bet $2$ and the actual outcome is $5$, you'll lose $3\$.

In the above, you see that you'll need to bet at $3.9323$ for the difference between mean and outcome. The most popular outcome is distance $1$ however.

Let's illustrate this with a more extreme example. I have a jar with $1000$ balls, $700$ have the symbol $1$, $300$ have the symbol $-1000$. You draw a ball from the jar and this is the amount of money you'll win (where negative means you lose). Would you play this game? Of course not. But why not?

If you look at the most popular outcome, it is $1$, so in most cases you'll win money. But you're not interested in that, you'll be interested in how much money you'll win on average. This is computed as

$$\frac{700}{1000}\cdot 1 + \frac{300}{1000}\cdot (-1000) = -299.3$$

This is why you won't play the game.

So what you need to learn is that the "average" and the "most popular outcome" (called the modus) can be drastically different. In symmetric situations, they are the same however, and that is what is often very misleading.

11. Jul 7, 2016

Yea it's a bit confusing. But I think I'm starting to wrap my head around it, again, thanks for taking the time eo explain all this stuff. 3 questions left.

12. Jul 7, 2016

### micromass

It's an approximation. It's not exact. The idea is that the smaller $\Delta x$ is, the better $P(x,\Delta x) = p(x)\Delta x$ is as an approximation. But it will never be exactly true... unless $\Delta x$ is something that we call an infinitesimal number.

13. Jul 7, 2016

So this is basically a way of him saying we devide the area under the curve into rectangles with infinitesimal width, where the area of each rectangle is the infinitesimal base times the hight, AKA the value of p(x) at that square is equal to the chance the D will land in said area. So it means that for large ∆x the chance of D landing there is the definite integral of the function in ∆x.

14. Jul 7, 2016

### micromass

Correct.

15. Jul 7, 2016

### micromass

What he said is pretty meaningless and even wrong. So I would ignore it and interpret the $\sqrt{N}$ as I detailed in my answer of question 2. It is clear from 6-7 that things behave like I said.

16. Jul 7, 2016

### micromass

And as for this. What he says is true, but it doesn't follow from anything he said so far. It's just something you can prove, but it's not that easy as he makes it seem.
In fact, the exact height is $\frac{1}{\sqrt{\pi N}}$.

17. Jul 7, 2016

I see. Well thank you very much for taking the time to answer all these questions, I understand it a lot better now.

18. Jul 7, 2016

### Stephen Tashi

Take that as a good example. If you are ever scheduled to give a lecture explaining probability, it would be a good idea to be called out of the city at that time.

You aren't being consistent in your use of the word "expected". "Expected" in the mathematical sense does not mean "what we anticipate will happen" or even "the outcome that is most probable". In your example 50 heads is both the "expected" value and the "most probable" value, but there are examples where the "expected value" is not the most probable value. And if the "most probable" event has a small probability then human beings don't "expect" it to happen - in the sense of anticipating that it will actually happen.

There is nothing paradoxical about a set of events in a probability space having more (total) probability that the single event in the space that is the most probable. For example the event "not exactly 50 heads are thrown" is more probable than the event "exactly 50 heads are thrown".

If you interpret "expected value " to mean "what we anticipate" then it is paradoxical to say that we "expect" 50 heads but we also "expect" a deviation of 5 from 50. However, the mathematical definition of "expected value" has nothing to do with anticipating what will actually happen.

In fact, probability theory has nothing to do with what will actually happen. It makes no guarantees about what will actually happen. The usual way people begin their study of probability is to use the crutch of thinking of a probability as an observed frequency. For example, they think of a probability distribution as a histogram of data from some experiment that was actually performed. This wrong way of thinking is very useful, even in studying advanced topics in probability theory. Some people never need to get beyond the wrong way of thinking.

The wrong way of thinking leads to various paradoxes and confusions. For example, if you are thinking about a probability distribution as a set of observed data then it becomes conceptually confusing when you are introduced to various random variables that measure how observed data differs from the probability distribution that the data comes from. (e.g. chi-squared tests).