# Strong law of large numbers - help understanding it

1. Nov 26, 2009

### WillJ

NOTE: This is a long post. If you want, you can skip ahead and start reading where I write "Things get really juicy".

The strong law of large numbers:

The stuff in parentheses is an event. But what is the sample space that that event is drawn from?

Imagine that our experiment is to select a number at random from [0,1], and we repeat this experiment in multiple trials. If we just do the experiment once, the sample space is [0,1]. If we do it twice, the sample space (if I'm not mistaken) is [0,1]X[0,1]. After 4 times, it's [0,1]X[0,1]X[0,1]X[0,1]. And so on. As the number of trials approaches infinity, our sample space approaches an infinite-dimensional sample space. The strong law of large numbers (if I'm not mistaken) says that, in this infinite-dimensional sample space (each outcome in this space being a countably-infinite string of numbers, each drawn from [0,1]), the subset of outcomes in which $$_{n}\stackrel{lim}{\rightarrow}_{\infty} \overline{X}_{n}=\mu$$ is true has probability 1, and the subset of outcomes in which that outcome is not true has probability 0.

Am I correct? Or is my thinking off?

It seems more common to think of an infinite sequence of iid random variables all defined on the same sample space. (Perhaps this is actually the correct interpretation, and my interpretation above is wrong.) Repeated trials of an experiment do not fit this description (at least not directly; see below). In that case, what makes things hard to comprehend is how we can have infinitely many iid r.v.s on the same sample space. The above example doesn't work (or maybe it does; see below). If our experiment is to select a number [0,1], and we do this multiple times, and Xi is the number we select in the ith trial, and we force {Xi} to all be on the same sample space (namely [0,1]), then for any s$$\in$$[0,1], X1(s)=s, X2(s)=s,X3(s)=s, and so on, i.e. the r.v.s are all the same and thus are dependent and the SLOLN does not apply. So we have to think of r.v.s that are all different from each other, yet identically distributed, all on the same sample space ... and infinitely many of them!

This issue was mentioned in a discussion on Terrence Tao's blog. Anonymous writes:

"In theory, we are speaking about a sequence X_1,X_2,... of random variables on the same common probability space. In practice, we think of instantiating a random variable in a sequence of trials. This is a subtle point: each trial is itself a random variable, distributed exactly as the original one, and leaving in an identical, but separate and independent, probability space. When we toss a die 100 times, there are 6^{100} possible outcomes. This means, we are no longer in our original six-element space, but in its cartesian power, with 6^{100} points. That is, tossing a die 100 times is described by the probability space which is the product of 100 copies of the space, corresponding to tossing just one die. So, our X_1,X_2,... actually leave in different spaces.

How does this agree with the theoretical assumption that the X_n’s are defined on the same space? To make this more specific: say, what is the “practical meaning” of pointwise convergence in the Strong law? When we think of trials, the variables \overline{X}_n live in distinct spaces; on what space they pointwise converge?"​

Professor Tao responds:

"In elementary (finitary) probability theory, the sample space (or probability space) is often defined in textbooks as simply the set of all possible outcomes, as this is the simplest choice of space to work in for most finitary applications. When it comes to infinitary probability theory, though, it is better to take a more flexible and abstract viewpoint: the sample space is now allowed to be an abstract set, and each outcome corresponds to a separate event inside that set. For instance, if one is studying the flip of a single coin, the sample space \Omega could be a two-element set {Heads, Tails}, but it doesn’t have to have just two elements; it could be a much larger set, partitioned into two disjoint subsets, the “Heads” event and the “Tails” event. For instance, the sample space could be the unit interval [0,1] with Lebesgue measure, and the Heads and Tails events could be the intervals [0,1/2) and (1/2,1] respectively.

For the purposes of probability theory, the exact size of the sample space does not matter; the only thing that matters is the algebra (or more precisely, \sigma-algebra) of events and the probability (or measure) which is assigned to each event; the actual points in the sample space are in fact largely irrelevant to probability theory. (See also the notion of equivalence of measure spaces, as defined for instance in Lecture 11 of my 254A course. Equivalent probability spaces may have very different cardinalities, but are indistinguishable from each other for the purposes of doing probability theory.)"​

Unfortunately such talk is rather mysterious to me. He gets around to saying:

To construct a suitable sample space to hold an infinite collection of random variables, such as an infinite sequence of independent die rolls, one can take an inverse limit of the sample spaces associated to finite sequences of die rolls.​

Sounds promising, although what exactly does that mean? Can anyone elaborate?

He continues:

One could also resort to more ad hoc devices, such as taking the sample space to be the unit interval {}[0,1] with Lebesgue measure (which is the standard sample space for selecting a random variable x uniformly at random from the unit interval) and then defining the value X_i of the i^th die roll to be the i^th digit of x base 6, plus one. In this space, the law of large numbers has this interpretation: when one selects a number at random from the unit interval, then almost surely, its base 6 digits are uniformly distributed amongst 0,1,2,3,4,5.​

Quite clever! And going back to the original example of selecting a number from [0,1], we could modify this into selecting a number from {0,1,...,9}, and doing independent trials of this, as represented by selecting a number from [0,1], with the outcome of the ith trial being the ith digit of the number selected. This seems nice, but also a little unsatisfactory to me, because it seems like we're taking the mysteriousness of the strong law of large numbers and explaining it by resorting to the mysteriousness of the real number system - yes, I suppose it's true that, on the interval [0,1], almost surely, the average of the digits of a number goes to 4.5 as one approaches the infinite digit of that number - and, assuming the digits are independent of one another, that's what the SLOLN says should happen - but that's not helping me very much.

Addressing the perhaps unsatisfactory-ness of the above, Tao writes:

Returning specifically to the question of finitary interpretations of the SLLN, these basically have to do with the situation in which one is simultaneously considering multiple averages \overline{X}_n of a single series of empirical samples, as opposed to considering just a single such average (which is basically the situation covered by the WLLN). For instance, if one had some random intensity field of grayscale pixels, and wanted to compare the average intensities at 10 x 10 blocks, 100 x 100 blocks, and 1000 x 1000 blocks, then the SLLN suggests that these intensities would be likely to be simultaneously close to the average intensity. (The WLLN only suggests that each of these spatial averages are individually likely to be close to the average intensity, but does not preclude the possibility that when one considers multiple such spatial averages at once, that a few outlying spatial averages will deviate from the average intensity. In my example with only three different averages, there isn’t much difference here, as the union bound only loses a factor of three at most for the failure probability, but the SLLN begins to show its strength over the WLLN when one is considering a very large number of averages at once.) ​

Hmm, anyone have any thoughts on that? Sounds promising, but I don't fully understand.

Things get really juicy, though, when Tao writes (explaining the SLLN vs. the WLLN):

Imagine a table in which the rows are all the possible points in the sample space (this is a continuum of rows, but never mind this), and the columns are the number n of trials, and there is a check mark whenever the empirical mean \overline{X}_n deviates significantly from the actual mean {\Bbb E} X. The weak law asserts that the density of check marks in each column goes to zero as one moves off to the right. The strong law asserts that almost all of the rows have only finitely many checkmarks. A double counting argument (or the Lebesgue dominated convergence theorem) then shows that the latter implies the former.​

I think that if I manage to understand this, then I'll understand the SLLN in general. But I need some help. I wrote on his blog:

I don’t understand. Each row represents an outcome in the sample space. Each column represents a trial of the experiment. Right? (We assume, for purposes of the illustration, that our sequence of iid random variables is as follows: for any given outcome, all of the r.v.s assign the same number to that outcome; the different r.v.s correspond to the different trials. – Right?) Now where do we put the checkmarks? I can think of two possibilities of what you mean: 1) We imagine repeatedly running the experiment. Each trial gives us an outcome. If trial i gives us outcome j, then put a dot in (row j, column i) in our matrix. Now we take a look at what we have, and at each dot that we put in our matrix, we calculate the empirical mean up to that point (based on where that dot is and where the previous dots are), and if it’s significantly different from the actual mean, then we put a check-mark over that dot. If not, leave the dot be. (We don’t have to ever actually put dots; I just mentioned them for clarification.) But then a column can’t have more than one check-mark, so that must not be what you’re saying. Perhaps, then, you mean: 2) For EVERY cell in our matrix (every [row j, column i] combination), we take a look, do the calculations, and put a check-mark in that cell if, at that point, the empirical mean is significantly different from the actual mean. But that doesn’t make any sense, because the value of the empirical mean at any given cell depends on what the outcomes were in the previous trials, i.e. there’s no unique empirical mean that can be assigned to each cell in our matrix. ​

Can any of you guys help me out here?

Last edited by a moderator: May 4, 2017
2. Nov 27, 2009

### bpet

There's a few important points you seem to be missing in all this.

Well it is a possible outcome but happens almost never. As T explained *almost all* rows don't look like this.

You need to be clear about whether an experiment/outcome is a single dice roll or an infinite set of dice rolls. For either case you seem to be misinterpreting the table. For his example of real number digit expansions, the (j,i)th entry would be the i'th digit of j where 0<j<1.

More importantly, after a finite number of trials you can't know which specific row to place the dot - only which subset of rows it could be on (those for which the first i entries agree with your first i outcomes). This relates to his earlier point about sigma-algebras - you'll need to study measure theory to understand the subtleties of this.

For any position on the table, the empirical mean is well-defined because all the required information is available in the cells left of it.

Hope this helps clarify things :)

3. Nov 28, 2009

### WillJ

Thanks for the help, bpet. Fortunately, a little while after posting this, I came to see how I was looking at it wrong, and now everything is clear to me. :) Specifically, now I understand it's best to think of running infinitely many trials of an experiment as one big experiment in itself, with each outcome in the sample space specifying the results of the infinitely many trials. Then the table thing makes perfect sense, as do the WLOLN and SLOLN in general. (One can, I suppose, alternatively imagine the infinite-trial experiment as having an infinite-dimensional sample space, which is what I was doing before, but, as I can attest to, that's more taxing on the brain and can trip you up.)