# Broken stick model?

## Main Question or Discussion Point

Anybody ever heard of the broken stick model? If so, what is it used for besides species abundance in ecology? Basically, n-1 points are randomly distributed on a unit interval, creating n intervals of random lengths. The points are independent, uniform random variables on the unit interval.
Here I am going to derive the expected length of Ir, the rth shortest interval using a brute force approach. The I’s are the order statistics for the intervals. That is, 0<=I1<= I2<= … Ir<= … In<= 1. Looking around online, there are some more sophisticated derivations for the expectation that I wasn't able to follow.
First I'll give a brief sketch of the my derivation. I'll fill in the details later.

To begin, we need to find the probability that Ir > x, or P(Ir > x). This condition holds exactly when there are at least n-r+1 intervals with length greater than x.
As a sub-problem, we need to find the probability that exactly j intervals have length > x, and then we can sum over j from n-r+1 to n. The set of n-1 points X1, X2, … X(n-1) can be thought of as a vector uniformly distributed on an (n-1)-cube of unit size. It turns out that the easiest probability to calculate is the probability that a particular group of s intervals (and possibly others, too) have lengths > x. This probability is just (1-sx)n-1, provided s is less than the integer part of 1/x. Otherwise it’s zero.
For example, the probability that the first two intervals starting from the left are greater than .01 in length is (1-2*.01)n-1 Because of the symmetry of the problem, all such groups have the same probability, so the sum over all groups of s intervals of the probability that at least those s intervals have length > x is given by n_C_s (1-sx)n-1, provided s < floor(1/x). You might think we could immediately get the probability that at least n-r+1 intervals have length > x by putting s=n-r+1. But the cases where there are more than s intervals with length > x are being counted multiple times, so we have to use the inclusion-exclusion principle. The probability that exactly j intervals have length > x is

1) $$\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}$$ if s<floor(1/x). Zero otherwise. In other words,
2) $$\sum_{s=j}^{\min(n,\floor{\frac{1}{x}})}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}$$
Therefore the tail probability distribution for the r-th smallest interval is given by
3) $$P(I_r > x) = \sum_{j=n-r+1}^{n}\sum_{s=j}^{\min(n,\floor{\frac{1}{x}})}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1 }$$
To find the expected value of Ir, we integrate this tail probability over the allowable values for x. The smallest x could be is zero. The largest it could be is 1/(n-r+1). This would happen if the smallest r-1 intervals were zero length and all other intervals had length 1/(n-r+1). So we need to integrate P(I_r>x) from 0 to 1/(n-r+1).
Because the upper limit of the sum over s depends on x, we have to break the integral up into small intervals:
4) $$E[I_r] = \int_{0}^{\frac{1}{n}}P(I_r > x)dx + \sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}P(I_r>x)dx$$
Plugging in (3) for P(I_r>x) we get two main terms for E[I_r].
5) $$E[I_r] = \int_{0}^{\frac{1}{n}} \sum_{j=n-r+1}^{n}\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx + \sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-l}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx$$
After performing the integrals and rearranging the summations, you get to this:
6) $$E[I_r] = \frac{1}{n}\sum_{j=n-r+1}^{n} \sum_{s=j}^{n}(-1)^{s-j} \frac{ \binom{s}{j} \binom{n}{s} }{s}$$
That inner sum over s had me stumped for over a week, but I finally figured out how it comes out to 1/j. That gives the desired answer
7) $$E[I_r] = \frac{1}{n}\sum_{j=n-r+1}^{n}\frac{1}{j}$$

Related Set Theory, Logic, Probability, Statistics News on Phys.org
First, a correction.

1) $$\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}$$ if s<floor(1/x). Zero otherwise. In other words...
What I should have said is that the term in the sum is $(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}$ if s<floor(1/x), and zero otherwise. The whole sum is obviously not zero.

Now back to filling in details.

It turns out that the easiest probability to calculate is the probability that a particular group of s intervals (and possibly others, too) have lengths > x. This probability is just (1-sx)n-1, provided s is less than the integer part of 1/x. Otherwise it’s zero.
For example, if n=3, then there are n-1 = 2 points uniformly distributed on the unit interval. The two points, X1 and X2, can satisfy X1 < X2 and X2 < X1 with equal probability, namely 1/2. Considering the case where X1<X2, what is the probability that the two leftmost intervals have length > x? If we take (X1, X2) to be a point uniformly distributed on a unit square, we are now considering the upper left half of that square (a triangle). The condition that the first interval has length > x means X1>x. The condition that the second interval has length > x means X2-X1>x. The area satisfying these two conditions is an equilateral right triangle of side 1-2x, which has area (1-2x)2/2. By symmetry, the case of X2<X1, corresponding to the lower right half of the square, gives the same probability. Adding these two cases together, the overall probability that the first two intervals from the left have length > x is (1-2x)2. Of course this requires that x<1/2. Otherwise, the probability is zero.

Generally, to find the probability that any particular s intervals of the n intervals have length > x, first consider the region of the (n-1)-cube corresponding to X1<X2<...<X(n-1). The volume of the subregion corresponding to s intervals with length >x is (1-sx)n-1/(n-1)!. Returning now to the entire (n-1)-cube, we see that there are (n-1)! regions corresponding to the different possible orderings of X1, X2,....X(n-1). Therefore, the probability that a particular group of s intervals have lengths > x is (1-sx)n-1 provided 1-sx>0. This includes the possibility that other intervals are also > x. But they don't have to be.

The probability that exactly j intervals have length > x is

1)$$\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}$$
This formula comes from the application of the principle of inclusion-exclusion to the problem of finding the probability that exactly j among n possible outcomes are realized. Call the outcomes A1, A2, ..., An. Let ~Ak stand for "outcome j was not realized". Then the different possibilities for exactly j outcomes are mutually exclusive:

A1 & A2 & ... & Aj & ~A(j+1) & ...& ~An
A1 & A2 & ... & A(j-1) & ~Aj & A(j+1) & ~A(j+2) &...&~An
.
.
.
~A1 & ~A2 & ... & ~A(n-j) & A(n-j+1) & ... & An

Consider the first possibility.

P(A1 & A2 & ... & Aj & ~A(j+1) & ...& ~An) = P(A1 & A2 & ... & Aj) - P(A1 & A2 & ... & Aj & at least one of the remaining outcomes is realized also)

We can apply PIE to that second term on the right hand side

P(A1 & A2 & ... & Aj & at least one of the remaining outcomes is realized also) =
8) $$\sum_{k=1}^{n-j}(-1)^{k-1} \sum_{j<i_{1} <...<i_{k} \leq n}P(A_{1} \& ... \& A_{j} \& A_{i_1} \& .... \& A_{i_k})$$

and therefore P(A1 & A2 & ... & Aj & ~A(j+1) & ...& ~An) is

9) $$\sum_{k=0}^{n-j}(-1)^{k} \sum_{j<i_{1} <...<i_{k} \leq n}P(A_{1} \& ... \& A_{j} \& A_{i_1} \& .... \& A_{i_k})$$

The other possibilities give similar expressions. Since the different possibilities are mutually exclusive, we can add their probabilities to get the probability that exactly j outcomes are realized. When we do this, each particular (j+k)-tuple occurs $\binom{j+k}{j}$ times, each time with sign (-1)^k. Therefore the final result is

P(exactly j of the outcomes A1, ..., An are realized) =

10) $$\sum_{k=0}^{n-j}(-1)^{k}\binom{j+k}{j} \sum_{1\leq i_{1} <...<i_{j+k} \leq n}P( A_{i_1} \& .... \& A_{i_{j+k}})$$

In our problem, an outcome Ai is the outcome that the i-th interval from the left has length > x. By the symmetry of the problem, all probabilities of (j+k)-tuples are equal:

11) $$P( A_{i_1} \& .... \& A_{i_{j+k}}) = (1-(j+k)x)^{n-1}$$

So the sum over all (j+k)-tuples is just

12) $$\sum_{1\leq i_{1} <...<i_{j+k}}P( A_{i_1} \& .... \& A_{i_{j+k}}) = \binom{n}{j+k}(1-(j+k)x)^{n-1}$$

The probability that exactly j intervals have length > x is therefore

13) $$\sum_{k=0}^{n-j}(-1)^{k}\binom{j+k}{j}\binom{n}{j+k}(1-(j+k)x)^{n-1}$$

Or, changing the index,

14) $$\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}$$

Well, that's almost it. As always, 1-sx must be nonnegative to contribute to probability. So we must change the upper limit like this...

15) $$\sum_{s=j}^{\min(n, \lfloor \frac{1}{x} \rfloor)}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1}$$

Stephen Tashi
Anybody ever heard of the broken stick model?
They say "topic sentences" are important. Is yours accurate? Or is your intent to request that someone check your derivation for the distribution of the shortest interval? If, so perhaps you could give a condensed version that incorporates all the corrections.

They say "topic sentences" are important. Is yours accurate? Or is your intent to request that someone check your derivation for the distribution of the shortest interval? If, so perhaps you could give a condensed version that incorporates all the corrections.
I'm not asking anyone to check anything. I am genuinely curious to know if anyone else has encountered this model and how they used it. I'm posting my derivation for my own future reference and for anyone who might benefit from it (an unlikely prospect, maybe).

Of course if someone spots mistakes and wants point them out, that would be great. But I'm not expecting anyone to read through it who isn't interested.

wrapping up

$$E[I_r] = \int_{0}^{\frac{1}{n}} \sum_{j=n-r+1}^{n}\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx + \sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-l}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx$$

Going back to equation (5) in my first post, which gives the expected value of the r-th shortest interval...

The first term is easy enough and comes out to

16) $$\frac{1}{n}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n}(-1)^{s-j+1}\frac{1}{s}\binom{s}{j}\binom{n}{s}[(1-\frac{s}{n})^n - 1]$$

To simplify the second term, it helps to move the sum over l from the outside to the inside. That way you can take advantage of the telescoping sum.

17) $$\sum_{l=1}^{r-1} \int_{\frac{1}{n-(l-1)}}^{\frac{1}{n-l}}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-l}(-1)^{s-j}\binom{s}{j}\binom{n}{s}(1-sx)^{n-1} dx = \frac{1}{n}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-1}(-1)^{s-j+1}\frac{1}{s}\binom{s}{j}\binom{n}{s}\sum_{l=1}^{n-s}[(1-\frac{s}{n-l})^n - (1-\frac{s}{n-(l-1)})^n]$$

which is just

18) $$\frac{1}{n}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n-1}(-1)^{s-j}\frac{1}{s}\binom{s}{j}\binom{n}{s}(1-\frac{s}{n})^n$$

Adding this term to the first term (16) gives

19) $$E[I_r] = \frac{1}{n}\sum_{j=n-r+1}^{n}\sum_{s=j}^{n}(-1)^{s-j}\frac{1}{s}\binom{s}{j}\binom{n}{s}$$

I could not think of a counting method for evaluating the inner sum over s, but I thought of something else.

20) $$\sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}\frac{1}{s} = \sum_{s=j}^{n}(-1)^{s-j}\binom{s}{j}\binom{n}{s}\int_{0}^{1}z^{s-1}dz$$

Pulling the integral outside and changing the index gives

21) $$\int_{0}^{1}z^{j-1} \sum_{k=0}^{n-j}(-1)^k \binom{j+k}{j} \binom{n}{j+k}z^k dz$$

Note that

$$\binom{j+k}{j} \binom{n}{j+k} = \binom{n}{j} \binom{n-j}{k}$$

So our sum is equal to

22) $$\binom{n}{j}\int_{0}^{1}z^{j-1} \sum_{k=0}^{n-j}(-1)^k \binom{n-j}{k} z^k dz= \binom{n}{j} \int_{0}^{1}z^{j-1}(1-z)^{n-j}dz$$

Repeated integration by parts shows that this is

23) $$\frac{\binom{n}{j}}{n \binom{n-1}{j-1}} = \frac{1}{j}$$