- #1
techmologist
- 306
- 12
Anybody ever heard of the broken stick model? If so, what is it used for besides species abundance in ecology? Basically, n-1 points are randomly distributed on a unit interval, creating n intervals of random lengths. The points are independent, uniform random variables on the unit interval.
Here I am going to derive the expected length of Ir, the rth shortest interval using a brute force approach. The I’s are the order statistics for the intervals. That is, 0<=I1<= I2<= … Ir<= … In<= 1. Looking around online, there are some more sophisticated derivations for the expectation that I wasn't able to follow.
First I'll give a brief sketch of the my derivation. I'll fill in the details later.
To begin, we need to find the probability that Ir > x, or P(Ir > x). This condition holds exactly when there are at least n-r+1 intervals with length greater than x.
As a sub-problem, we need to find the probability that exactly j intervals have length > x, and then we can sum over j from n-r+1 to n. The set of n-1 points X1, X2, … X(n-1) can be thought of as a vector uniformly distributed on an (n-1)-cube of unit size. It turns out that the easiest probability to calculate is the probability that a particular group of s intervals (and possibly others, too) have lengths > x. This probability is just (1-sx)n-1, provided s is less than the integer part of 1/x. Otherwise it’s zero.
For example, the probability that the first two intervals starting from the left are greater than .01 in length is (1-2*.01)n-1 Because of the symmetry of the problem, all such groups have the same probability, so the sum over all groups of s intervals of the probability that at least those s intervals have length > x is given by n_C_s (1-sx)n-1, provided s < floor(1/x). You might think we could immediately get the probability that at least n-r+1 intervals have length > x by putting s=n-r+1. But the cases where there are more than s intervals with length > x are being counted multiple times, so we have to use the inclusion-exclusion principle. The probability that exactly j intervals have length > x is
1) [tex]\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/tex] if s<floor(1/x). Zero otherwise. In other words,
2) [tex]\sum_{s=j}^{\min(n,\floor{\frac{1}{x}})}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/tex]
Therefore the tail probability distribution for the r-th smallest interval is given by
3) [tex]P(I_r > x) = \sum_{j=n-r+1}^{n}\sum_{s=j}^{\min(n,\floor{\frac{1}{x}})}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1 }[/tex]
To find the expected value of Ir, we integrate this tail probability over the allowable values for x. The smallest x could be is zero. The largest it could be is 1/(n-r+1). This would happen if the smallest r-1 intervals were zero length and all other intervals had length 1/(n-r+1). So we need to integrate P(I_r>x) from 0 to 1/(n-r+1).
Because the upper limit of the sum over s depends on x, we have to break the integral up into small intervals:
4) [tex]E[I_r] = \int_{0}^{\frac{1}{n}}P(I_r > x)dx + \sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}P(I_r>x)dx[/tex]
Plugging in (3) for P(I_r>x) we get two main terms for E[I_r].
5) [tex]E[I_r] = \int_{0}^{\frac{1}{n}} \sum_{j=n-r+1}^{n}\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx + \sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-l}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx[/tex]
After performing the integrals and rearranging the summations, you get to this:
6) [tex]E[I_r] = \frac{1}{n}\sum_{j=n-r+1}^{n} \sum_{s=j}^{n}(-1)^{s-j} \frac{ \binom{s}{j} \binom{n}{s} }{s}[/tex]
That inner sum over s had me stumped for over a week, but I finally figured out how it comes out to 1/j. That gives the desired answer
7) [tex]E[I_r] = \frac{1}{n}\sum_{j=n-r+1}^{n}\frac{1}{j}[/tex]
Here I am going to derive the expected length of Ir, the rth shortest interval using a brute force approach. The I’s are the order statistics for the intervals. That is, 0<=I1<= I2<= … Ir<= … In<= 1. Looking around online, there are some more sophisticated derivations for the expectation that I wasn't able to follow.
First I'll give a brief sketch of the my derivation. I'll fill in the details later.
To begin, we need to find the probability that Ir > x, or P(Ir > x). This condition holds exactly when there are at least n-r+1 intervals with length greater than x.
As a sub-problem, we need to find the probability that exactly j intervals have length > x, and then we can sum over j from n-r+1 to n. The set of n-1 points X1, X2, … X(n-1) can be thought of as a vector uniformly distributed on an (n-1)-cube of unit size. It turns out that the easiest probability to calculate is the probability that a particular group of s intervals (and possibly others, too) have lengths > x. This probability is just (1-sx)n-1, provided s is less than the integer part of 1/x. Otherwise it’s zero.
For example, the probability that the first two intervals starting from the left are greater than .01 in length is (1-2*.01)n-1 Because of the symmetry of the problem, all such groups have the same probability, so the sum over all groups of s intervals of the probability that at least those s intervals have length > x is given by n_C_s (1-sx)n-1, provided s < floor(1/x). You might think we could immediately get the probability that at least n-r+1 intervals have length > x by putting s=n-r+1. But the cases where there are more than s intervals with length > x are being counted multiple times, so we have to use the inclusion-exclusion principle. The probability that exactly j intervals have length > x is
1) [tex]\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/tex] if s<floor(1/x). Zero otherwise. In other words,
2) [tex]\sum_{s=j}^{\min(n,\floor{\frac{1}{x}})}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/tex]
Therefore the tail probability distribution for the r-th smallest interval is given by
3) [tex]P(I_r > x) = \sum_{j=n-r+1}^{n}\sum_{s=j}^{\min(n,\floor{\frac{1}{x}})}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1 }[/tex]
To find the expected value of Ir, we integrate this tail probability over the allowable values for x. The smallest x could be is zero. The largest it could be is 1/(n-r+1). This would happen if the smallest r-1 intervals were zero length and all other intervals had length 1/(n-r+1). So we need to integrate P(I_r>x) from 0 to 1/(n-r+1).
Because the upper limit of the sum over s depends on x, we have to break the integral up into small intervals:
4) [tex]E[I_r] = \int_{0}^{\frac{1}{n}}P(I_r > x)dx + \sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}P(I_r>x)dx[/tex]
Plugging in (3) for P(I_r>x) we get two main terms for E[I_r].
5) [tex]E[I_r] = \int_{0}^{\frac{1}{n}} \sum_{j=n-r+1}^{n}\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx + \sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-l}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx[/tex]
After performing the integrals and rearranging the summations, you get to this:
6) [tex]E[I_r] = \frac{1}{n}\sum_{j=n-r+1}^{n} \sum_{s=j}^{n}(-1)^{s-j} \frac{ \binom{s}{j} \binom{n}{s} }{s}[/tex]
That inner sum over s had me stumped for over a week, but I finally figured out how it comes out to 1/j. That gives the desired answer
7) [tex]E[I_r] = \frac{1}{n}\sum_{j=n-r+1}^{n}\frac{1}{j}[/tex]