Broken Stick Model: Use & Expected Length

In summary, the broken stick model is a way of randomly distributing n-1 points on a unit interval, creating n intervals of random lengths. This model is primarily used for studying species abundance in ecology, but also has other applications. To find the expected length of the rth shortest interval, a brute force approach can be used by considering the order statistics for the intervals. More sophisticated derivations for the expectation exist, but can be difficult to follow. The probability that a particular group of intervals have lengths > x can be calculated using the inclusion-exclusion principle, and the expected value of Ir can be found by integrating this probability over the allowable values of x. The final equation for E[Ir] is (1/n) * sum from
  • #1
techmologist
306
12
Anybody ever heard of the broken stick model? If so, what is it used for besides species abundance in ecology? Basically, n-1 points are randomly distributed on a unit interval, creating n intervals of random lengths. The points are independent, uniform random variables on the unit interval.
Here I am going to derive the expected length of Ir, the rth shortest interval using a brute force approach. The I’s are the order statistics for the intervals. That is, 0<=I1<= I2<= … Ir<= … In<= 1. Looking around online, there are some more sophisticated derivations for the expectation that I wasn't able to follow.
First I'll give a brief sketch of the my derivation. I'll fill in the details later.

To begin, we need to find the probability that Ir > x, or P(Ir > x). This condition holds exactly when there are at least n-r+1 intervals with length greater than x.
As a sub-problem, we need to find the probability that exactly j intervals have length > x, and then we can sum over j from n-r+1 to n. The set of n-1 points X1, X2, … X(n-1) can be thought of as a vector uniformly distributed on an (n-1)-cube of unit size. It turns out that the easiest probability to calculate is the probability that a particular group of s intervals (and possibly others, too) have lengths > x. This probability is just (1-sx)n-1, provided s is less than the integer part of 1/x. Otherwise it’s zero.
For example, the probability that the first two intervals starting from the left are greater than .01 in length is (1-2*.01)n-1 Because of the symmetry of the problem, all such groups have the same probability, so the sum over all groups of s intervals of the probability that at least those s intervals have length > x is given by n_C_s (1-sx)n-1, provided s < floor(1/x). You might think we could immediately get the probability that at least n-r+1 intervals have length > x by putting s=n-r+1. But the cases where there are more than s intervals with length > x are being counted multiple times, so we have to use the inclusion-exclusion principle. The probability that exactly j intervals have length > x is

1) [tex]\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/tex] if s<floor(1/x). Zero otherwise. In other words,
2) [tex]\sum_{s=j}^{\min(n,\floor{\frac{1}{x}})}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/tex]
Therefore the tail probability distribution for the r-th smallest interval is given by
3) [tex]P(I_r > x) = \sum_{j=n-r+1}^{n}\sum_{s=j}^{\min(n,\floor{\frac{1}{x}})}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1 }[/tex]
To find the expected value of Ir, we integrate this tail probability over the allowable values for x. The smallest x could be is zero. The largest it could be is 1/(n-r+1). This would happen if the smallest r-1 intervals were zero length and all other intervals had length 1/(n-r+1). So we need to integrate P(I_r>x) from 0 to 1/(n-r+1).
Because the upper limit of the sum over s depends on x, we have to break the integral up into small intervals:
4) [tex]E[I_r] = \int_{0}^{\frac{1}{n}}P(I_r > x)dx + \sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}P(I_r>x)dx[/tex]
Plugging in (3) for P(I_r>x) we get two main terms for E[I_r].
5) [tex]E[I_r] = \int_{0}^{\frac{1}{n}} \sum_{j=n-r+1}^{n}\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx + \sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-l}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx[/tex]
After performing the integrals and rearranging the summations, you get to this:
6) [tex]E[I_r] = \frac{1}{n}\sum_{j=n-r+1}^{n} \sum_{s=j}^{n}(-1)^{s-j} \frac{ \binom{s}{j} \binom{n}{s} }{s}[/tex]
That inner sum over s had me stumped for over a week, but I finally figured out how it comes out to 1/j. That gives the desired answer
7) [tex]E[I_r] = \frac{1}{n}\sum_{j=n-r+1}^{n}\frac{1}{j}[/tex]
 
Physics news on Phys.org
  • #2
First, a correction.

techmologist said:
1) [tex]\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/tex] if s<floor(1/x). Zero otherwise. In other words...

What I should have said is that the term in the sum is [itex](-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/itex] if s<floor(1/x), and zero otherwise. The whole sum is obviously not zero.

Now back to filling in details.

It turns out that the easiest probability to calculate is the probability that a particular group of s intervals (and possibly others, too) have lengths > x. This probability is just (1-sx)n-1, provided s is less than the integer part of 1/x. Otherwise it’s zero.

For example, if n=3, then there are n-1 = 2 points uniformly distributed on the unit interval. The two points, X1 and X2, can satisfy X1 < X2 and X2 < X1 with equal probability, namely 1/2. Considering the case where X1<X2, what is the probability that the two leftmost intervals have length > x? If we take (X1, X2) to be a point uniformly distributed on a unit square, we are now considering the upper left half of that square (a triangle). The condition that the first interval has length > x means X1>x. The condition that the second interval has length > x means X2-X1>x. The area satisfying these two conditions is an equilateral right triangle of side 1-2x, which has area (1-2x)2/2. By symmetry, the case of X2<X1, corresponding to the lower right half of the square, gives the same probability. Adding these two cases together, the overall probability that the first two intervals from the left have length > x is (1-2x)2. Of course this requires that x<1/2. Otherwise, the probability is zero.

Generally, to find the probability that any particular s intervals of the n intervals have length > x, first consider the region of the (n-1)-cube corresponding to X1<X2<...<X(n-1). The volume of the subregion corresponding to s intervals with length >x is (1-sx)n-1/(n-1)!. Returning now to the entire (n-1)-cube, we see that there are (n-1)! regions corresponding to the different possible orderings of X1, X2,...X(n-1). Therefore, the probability that a particular group of s intervals have lengths > x is (1-sx)n-1 provided 1-sx>0. This includes the possibility that other intervals are also > x. But they don't have to be.
 
  • #3
The probability that exactly j intervals have length > x is

1)[tex] \sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/tex]

This formula comes from the application of the principle of inclusion-exclusion to the problem of finding the probability that exactly j among n possible outcomes are realized. Call the outcomes A1, A2, ..., An. Let ~Ak stand for "outcome j was not realized". Then the different possibilities for exactly j outcomes are mutually exclusive:

A1 & A2 & ... & Aj & ~A(j+1) & ...& ~An
A1 & A2 & ... & A(j-1) & ~Aj & A(j+1) & ~A(j+2) &...&~An
.
.
.
~A1 & ~A2 & ... & ~A(n-j) & A(n-j+1) & ... & An

Consider the first possibility.

P(A1 & A2 & ... & Aj & ~A(j+1) & ...& ~An) = P(A1 & A2 & ... & Aj) - P(A1 & A2 & ... & Aj & at least one of the remaining outcomes is realized also)

We can apply PIE to that second term on the right hand side

P(A1 & A2 & ... & Aj & at least one of the remaining outcomes is realized also) =
8) [tex]\sum_{k=1}^{n-j}(-1)^{k-1} \sum_{j<i_{1} <...<i_{k} \leq n}P(A_{1} \& ... \& A_{j} \& A_{i_1} \& ... \& A_{i_k})[/tex]

and therefore P(A1 & A2 & ... & Aj & ~A(j+1) & ...& ~An) is

9) [tex]\sum_{k=0}^{n-j}(-1)^{k} \sum_{j<i_{1} <...<i_{k} \leq n}P(A_{1} \& ... \& A_{j} \& A_{i_1} \& ... \& A_{i_k})[/tex]

The other possibilities give similar expressions. Since the different possibilities are mutually exclusive, we can add their probabilities to get the probability that exactly j outcomes are realized. When we do this, each particular (j+k)-tuple occurs [itex]\binom{j+k}{j}[/itex] times, each time with sign (-1)^k. Therefore the final result is

P(exactly j of the outcomes A1, ..., An are realized) =

10) [tex]\sum_{k=0}^{n-j}(-1)^{k}\binom{j+k}{j} \sum_{1\leq i_{1} <...<i_{j+k} \leq n}P( A_{i_1} \& ... \& A_{i_{j+k}})[/tex]

In our problem, an outcome Ai is the outcome that the i-th interval from the left has length > x. By the symmetry of the problem, all probabilities of (j+k)-tuples are equal:

11) [tex]P( A_{i_1} \& ... \& A_{i_{j+k}}) = (1-(j+k)x)^{n-1}[/tex]

So the sum over all (j+k)-tuples is just

12) [tex]\sum_{1\leq i_{1} <...<i_{j+k}}P( A_{i_1} \& ... \& A_{i_{j+k}}) = \binom{n}{j+k}(1-(j+k)x)^{n-1}[/tex]

The probability that exactly j intervals have length > x is therefore

13) [tex]\sum_{k=0}^{n-j}(-1)^{k}\binom{j+k}{j}\binom{n}{j+k}(1-(j+k)x)^{n-1}[/tex]

Or, changing the index,

14) [tex]\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/tex]

Well, that's almost it. As always, 1-sx must be nonnegative to contribute to probability. So we must change the upper limit like this...

15) [tex]\sum_{s=j}^{\min(n, \lfloor \frac{1}{x} \rfloor)}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}[/tex]
 
  • #4
techmologist said:
Anybody ever heard of the broken stick model?

They say "topic sentences" are important. Is yours accurate? Or is your intent to request that someone check your derivation for the distribution of the shortest interval? If, so perhaps you could give a condensed version that incorporates all the corrections.
 
  • #5
Stephen Tashi said:
They say "topic sentences" are important. Is yours accurate? Or is your intent to request that someone check your derivation for the distribution of the shortest interval? If, so perhaps you could give a condensed version that incorporates all the corrections.

I'm not asking anyone to check anything. I am genuinely curious to know if anyone else has encountered this model and how they used it. I'm posting my derivation for my own future reference and for anyone who might benefit from it (an unlikely prospect, maybe).

Of course if someone spots mistakes and wants point them out, that would be great. But I'm not expecting anyone to read through it who isn't interested.
 
  • #6
wrapping up

[tex]E[I_r] = \int_{0}^{\frac{1}{n}} \sum_{j=n-r+1}^{n}\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx + \sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-l}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx[/tex]

Going back to equation (5) in my first post, which gives the expected value of the r-th shortest interval...

The first term is easy enough and comes out to

16) [tex]\frac{1}{n}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n}(-1)^{s-j+1}\frac{1}{s}\binom{s}{j}\binom{n}{s}[(1-\frac{s}{n})^n - 1][/tex]

To simplify the second term, it helps to move the sum over l from the outside to the inside. That way you can take advantage of the telescoping sum.

17) [tex]\sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-l}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx = \frac{1}{n}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-1}(-1)^{s-j+1}\frac{1}{s}\binom{s}{j}\binom{n}{s}\sum_{l=1}^{n-s}[(1-\frac{s}{n-l})^n - (1-\frac{s}{n-(l-1)})^n][/tex]

which is just

18) [tex]\frac{1}{n}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-1}(-1)^{s-j}\frac{1}{s}\binom{s}{j}\binom{n}{s}(1-\frac{s}{n})^n[/tex]

Adding this term to the first term (16) gives

19) [tex]E[I_r] = \frac{1}{n}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n}(-1)^{s-j}\frac{1}{s}\binom{s}{j}\binom{n}{s}[/tex]

I could not think of a counting method for evaluating the inner sum over s, but I thought of something else.

20) [tex]\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}\frac{1}{s} = \sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}\int_{0}^{1}z^{s-1}dz [/tex]

Pulling the integral outside and changing the index gives

21) [tex]\int_{0}^{1}z^{j-1} \sum_{k=0}^{n-j}(-1)^k \binom{j+k}{j} \binom{n}{j+k}z^k dz[/tex]

Note that

[tex]\binom{j+k}{j} \binom{n}{j+k} = \binom{n}{j} \binom{n-j}{k}[/tex]

So our sum is equal to

22) [tex]\binom{n}{j}\int_{0}^{1}z^{j-1} \sum_{k=0}^{n-j}(-1)^k \binom{n-j}{k} z^k dz= \binom{n}{j} \int_{0}^{1}z^{j-1}(1-z)^{n-j}dz[/tex]

Repeated integration by parts shows that this is

23) [tex]\frac{\binom{n}{j}}{n \binom{n-1}{j-1}} = \frac{1}{j}[/tex]
 

What is the Broken Stick Model?

The Broken Stick Model is a statistical model used to predict the expected length of a broken stick. It is based on the assumption that when a stick is broken at a random point, the resulting pieces will be of different lengths.

What is the purpose of using the Broken Stick Model?

The Broken Stick Model is used to estimate the expected length of a broken stick, which can be useful in various fields such as ecology, physics, and finance. It can also be used as a null model in statistical tests to determine if there are any underlying patterns or influences on the expected length of a broken stick.

How is the Broken Stick Model calculated?

The Broken Stick Model is calculated by dividing the length of the stick by a random number generated between 0 and 1. This random number represents the point at which the stick will break, and the resulting pieces will have lengths proportional to this number. This process is repeated multiple times to determine the expected length of the broken stick.

What are the assumptions of the Broken Stick Model?

The Broken Stick Model assumes that the stick is broken at a random point and that there is no relationship between the length of the stick and the location at which it is broken. It also assumes that the stick is broken into two pieces, and that the resulting pieces are independent of each other.

What are the limitations of the Broken Stick Model?

The Broken Stick Model may not accurately predict the expected length of a broken stick in all situations. It assumes that the stick is broken at a random point, which may not be the case in real-life scenarios. Additionally, the model does not take into account external factors that may influence the length of the broken stick, such as the strength or density of the stick.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
834
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
623
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
635
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
Replies
0
Views
354
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
0
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
Back
Top