Simple problems regarding sum of IID random variables

In summary, the conversation discusses the concept of weight gain and loss through swallowing marbles and the theoretical distribution of daily weight gain. The main questions raised are about the fraction of days expected to have weight gain, the difference between random variables and their simulations, and the rate of convergence of the distribution to a Gaussian distribution.
  • #1
madilyn
13
0
Hi! I'm taking my first course in statistics and am hoping to get some intuition for this set of problems...

Suppose I have a bowl of marbles that each weighs [itex]m_{marble}=0.01[/itex] kg.

For each marble I swallow, there is a chance [itex]p=0.53[/itex] that it adds [itex]m_{marble}[/itex] to my weight, and chance [itex]1-p[/itex] that it causes me to puke, therefore losing [itex]m_{puke}=0.011[/itex] kg of my weight.

1. Assume I religiously swallow [itex]n=10^{4}[/itex] marbles each day. What fraction of the days do I expect to gain weight on?

Let [itex]X_{i}[/itex]denote the random variable for my weight gained for each swallowed marble, indexed by [itex]i\in\mathbb{Z}^{+}[/itex].

Let [itex]Y[/itex] denote the random variable for my total weight gained each day from swallowing [itex]n[/itex] marbles, [itex]Y=\sum_{i=1}^{n}X_{i}[/itex]. Then, denote
[tex]E\left(X\right) := E\left(X_{1}\right)=E\left(X_{2}\right)...=E\left(X_{n}\right)[/tex]
[tex]Var\left(X\right) := Var\left(X_{1}\right)=Var\left(X_{2}\right)=...=Var\left(X_{n}\right)[/tex]
such that the theoretical distribution of my daily weight gain is approximately normal with mean
[tex]E\left(Y\right)=E\left(X_{1}+...+X_{n}\right)=nE\left(X\right)[/tex]
and variance
[tex]Var\left(Y\right)=Var\left(X_{1}+...+X_{n}\right)=nVar\left(X\right)[/tex]
Then, I expect to gain weight on [itex]1-P\left(Y\leq0\right)\approx0.892[/itex] of the days. Is this correct?

2. Why does [itex]Y[/itex] approximately follow the distribution [itex]N\left(nE\left(X\right),\sqrt{nVar\left(X\right)}\right)[/itex]?

Firstly, am I correct that the variance is [itex]nVar\left(X\right)[/itex] and not [itex]n^{2}Var\left(X\right)[/itex]? Can someone refresh me what's the intuitive difference between the random variable [itex]Y=X_{1}+...+X_{n}[/itex] as compared to [itex]Y=50X[/itex] again?

Secondly, it's not immediately obvious to me how the distribution approaches a Gaussian distribution as [itex]n\rightarrow\infty[/itex]? Perhaps I can formulate this in terms of the convolution of a discrete function representing the distribution of my weight gain/loss for each marble swallowed? Will the discrete convolution approach generalize nicely to the sum of any discrete random variable?

3. For finite [itex]n[/itex], am I correct that this distribution converges faster to a Gaussian distribution near the center and slower in the tails as [itex]n[/itex] increases? Can I quantify this rate of convergence?

Just from my intuition, I think the best strategy to attack this problem is to express the sum of the random variables [itex]Y=\sum_{i=1}^{n}X_{i}[/itex] as a Fourier transform and then investigate the rate of convergence using an asymptotic expansion of the integral in large [itex]n[/itex] i.e. saddle point method?

Thanks!
 
Physics news on Phys.org
  • #2
madilyn said:
Hi! I'm taking my first course in statistics and am hoping to get some intuition for this set of problems...

This thread will probably be moved to the homework section. (I think the math homework section would be more obvious as a "math homework" subsection of mathematics - so I sympathize with the misplacement.) For homework, you should state the problem - rather than leave the helpers to guess it from reading your reasoning.
Then, I expect to gain weight on [itex]1-P\left(Y\leq0\right)\approx0.892[/itex] of the days. Is this correct?

The question of "did gain" vs "did not gain" is a different question that "how much" is gained. "Did gain" can be represented by a random variable that only takes on the values 1 or 0. (A bernoulli random variable.) The theorems about expectation and variance apply to such a random variable but you'd get different answers than you get from answering questions about "how much".
Can someone refresh me what's the intuitive difference between the random variable [itex]Y=X_{1}+...+X_{n}[/itex] as compared to [itex]Y=50X[/itex] again?

Think of a computer simulation. The algorithm for simulating 50 X is to make a random determination for X and then multiply it by 50. The algorithm for simulation the sum of 50 different realizations of X is obviously to make 50 random determinations of X and add them. When you make 50 different random determinations, you have the possibility that opposite extremes will "cancel out". You don't get that with a simulation of 50 X since you only make one random determination for X.
3. For finite [itex]n[/itex], am I correct that this distribution converges faster to a Gaussian distribution near the center and slower in the tails as [itex]n[/itex] increases? Can I quantify this rate of convergence?

That's a good question! I'll guess there are several ways, but i don't know them. One thing to do is understand the distinction between "pointwise convergence" and "uniform convergence". Convergence of a sequence of functions to another function is more complicated than convergence of a function evaluated at a sequence of points to a single number. There are several different definitions for "convergence" when we deal with sequences of functions converging to a single function.
 
  • #3
Stephen Tashi said:
This thread will probably be moved to the homework section. (I think the math homework section would be more obvious as a "math homework" subsection of mathematics - so I sympathize with the misplacement.) For homework, you should state the problem - rather than leave the helpers to guess it from reading your reasoning.

Oh sorry, I came up with the questions myself so it isn't a homework problem.
Stephen Tashi said:
The question of "did gain" vs "did not gain" is a different question that "how much" is gained. "Did gain" can be represented by a random variable that only takes on the values 1 or 0. (A bernoulli random variable.) The theorems about expectation and variance apply to such a random variable but you'd get different answers than you get from answering questions about "how much".

I see. What would you intuitively interpret the 89.2% figure as if not the fraction of days I expect to have positive weight gain?

Stephen Tashi said:
Think of a computer simulation. The algorithm for simulating 50 X is to make a random determination for X and then multiply it by 50. The algorithm for simulation the sum of 50 different realizations of X is obviously to make 50 random determinations of X and add them. When you make 50 different random determinations, you have the possibility that opposite extremes will "cancel out". You don't get that with a simulation of 50 X since you only make one random determination for X.

Ah, this made a lot of sense! Thanks!

Stephen Tashi said:
That's a good question! I'll guess there are several ways, but i don't know them. One thing to do is understand the distinction between "pointwise convergence" and "uniform convergence". Convergence of a sequence of functions to another function is more complicated than convergence of a function evaluated at a sequence of points to a single number. There are several different definitions for "convergence" when we deal with sequences of functions converging to a single function.

Yes, I figured this is a difficult problem, but also one of the more interesting ones I've thought of while learning statistics. What's a good starting point for me to learn about the different definitions of "convergence" that you mentioned in your last sentence?

Thanks so much!
 
  • #4
madilyn said:
3. For finite [itex]n[/itex], am I correct that this distribution converges faster to a Gaussian distribution near the center and slower in the tails as [itex]n[/itex] increases? Can I quantify this rate of convergence?
A couple of notes. (Talking informally in terms of X=sum of events, not divided by n.)

1) You are talking about approximating a continuous PDF with a discrete PDF so they will always differ significantly at values of X that are not possible for the discrete PDF. You can handle this several ways. I think the best way is to look at convergence of the CDFs instead of the PDF. The CDF of a discrete real random variable is easy to unambiguously extend to all real values.

2) You know that the discrete CDF is 1.0 for all values of X > n. And n+epsilon is where the normal distribution CDF is farthest from 1.0. I suspect that this is where the greatest difference between the two CDFs is and that it gives the rate of uniform convergence. I can not prove it without some work.
 
  • #5



Hello there,

It's great to hear that you're taking your first course in statistics and are already thinking deeply about these problems! Let me try to provide some insights and answers to your questions.

1. Your calculation of the probability of gaining weight on a given day looks correct. However, it's important to note that this is an approximation based on the assumption that the distribution of your weight gain is approximately normal. This may not be the case in reality, as there may be other factors at play that could affect the distribution. So while your calculation is a good starting point, it's always important to consider the limitations of your assumptions.

2. The variance of Y is indeed nVar(X) and not n^2Var(X). This is because when we add independent random variables, their variances add as well. So for a sum of n random variables, the variance will be n times the variance of a single random variable. As for the difference between Y = X_1 + ... + X_n and Y = 50X, the main difference is in the scale. Y = X_1 + ... + X_n is a sum of n random variables, each with their own distribution and mean. Y = 50X is simply a scaled version of a single random variable X, where the mean is multiplied by 50.

The reason why Y approximately follows a normal distribution as n increases is due to the Central Limit Theorem. This theorem states that as the sample size (n) increases, the distribution of the sample mean approaches a normal distribution, regardless of the distribution of the individual values. In this case, Y is the sum of n random variables, which can be thought of as a sample mean of n values.

3. It is correct that the distribution converges faster to a Gaussian distribution near the center and slower in the tails as n increases. This is because the Central Limit Theorem states that the distribution of the sample mean approaches a normal distribution, but the rate of convergence depends on the underlying distribution of the individual values. If the individual values are already normally distributed, then the convergence will be faster. If the individual values have heavy tails (i.e. a higher probability of extreme values), then the convergence will be slower.

As for quantifying the rate of convergence, it can be done using various methods such as the Berry-Esseen theorem or the Edgeworth expansion. These methods provide approximations for the error in the normal approximation and
 

1. What are IID random variables?

IID stands for Independent and Identically Distributed. In simple terms, it means that the random variables are independent from each other and have the same probability distribution. This means that the variables do not affect each other and have the same chance of occurring.

2. What is the formula for calculating the sum of IID random variables?

The formula for calculating the sum of IID random variables is:
S = X1 + X2 + X3 + ... + Xn
Where X1, X2, X3, ... Xn are the individual random variables.

3. How do you determine the mean of the sum of IID random variables?

The mean of the sum of IID random variables can be determined by taking the sum of the individual means of the random variables. In other words, the mean of the sum is equal to the sum of the means.

4. What is the central limit theorem and how does it relate to sum of IID random variables?

The central limit theorem states that the sum of a large number of independent random variables will follow a normal distribution, even if the individual random variables do not. This means that as the number of IID random variables increases, their sum will approach a normal distribution, making it easier to analyze and make predictions.

5. What are some real-world applications of sum of IID random variables?

The sum of IID random variables is commonly used in statistics, finance, and risk analysis. For example, it can be used to calculate the expected value and variance in stock prices, or to analyze the risk of a portfolio by considering the sum of returns from multiple investments. It is also used in quality control to determine the total variation in a manufacturing process.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
331
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
25
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
855
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
365
Back
Top