Order Statistics: CDF

  • I
  • Thread starter EngWiPy
  • Start date
  • #1
1,367
61

Main Question or Discussion Point

Suppose I have the random variables ##Z_k=X_k/Y_k## with a PDF ##f_{Z_k}(z_k)## for ##k=1,\,2\,\ldots, K##, where ##\{X_k, Y_k\}## are i.i.d. random variables. I can find

[tex]\text{Pr}\left[\sum_{i=1}^3Z_k\leq \eta\right][/tex]

as

[tex]\int_{z_3=0}^{\eta}\int_{z_2=0}^{\eta-z_3}\int_{z_1=0}^{\eta-z_3-z_2}f_{Z_1}(z_1)f_{Z_2}(z_2)f_{Z_3}(z_3)\,dz_1dz_2dz_3[/tex]

Now suppose I arrange the random variables ##\{X_k\}_{k=1}^K## as

[tex]X_{(1)}\leq X_{(2)}\leq \cdots \leq X_{(K)}[/tex]

and define ##G_k=X_{(k)}/Y_{i_k}##, where ##i_k## is the index of the ##k^{\text{th}}## order statistic.

How can I find

[tex]\text{Pr}\left[\sum_{i=1}^3G_k\leq \eta\right][/tex]
 

Answers and Replies

  • #2
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
where ##\{X_k, Y_k\}## are i.i.d. random variables
This doesn't make it clear what is identically distributed to what, or what is independent of what.

The maximal interpretation is that all ##X_k## and all ##Y_k## have the same distribution, with PDF ##f_X##, and that the set of random variables ##\{X_1,...,X_K,Y_1,...,Y_K\}## is pairwise independent. Or it could mean that ##X_k## has the same distribution as ##Y_k## and the two are independent of each other, but the distributions may differ from that of ##X_j## for ##j\neq k##, and there may be dependences between ##X_k## and ##X_j##. There are a variety of possible interpretations.

I will assume that the set of all ##2K## random variables is pairwise independent, and that all ##X_k## have the same PDF ##f_X## and all the ##Y_k## have the same PDF ##f_Y##, but the two PDFs may differ from one another. I will also assume that the support of all the random variables is the positive reals, which is implied by the integral you have written.

With those assumptions, the distribution of ##X_{(k)}/Y_{i_k}## is the same as that of ##X_{(k)}/Y_{j}## for any ##j##. Then the CDF of ##G_k## satisfies:

$$ F_{G_k}(a) = \int_0^\infty\int_0^{\eta a}f_{X_{(k)}}(x) \,dx\,f_Y(y)\,dy$$

The PDF ##f_{X_{(k)}}## of the ##k##th order statistic of ##X## can be calculated using the formula here.

The PDF ##f_{G_k}## of ##G_k## is obtained by differentiating the above expression.
Having obtained that, we calculate the CDF of ##G_1+G_2+G_3## using the triple integral you wrote above, but replacing each ##f_{Z_k}## by ##f_{G_k}##.
 
  • #3
1,367
61
What I meant by ##\{X_k, Y_k\}## of being i.i.d. is that every single random variable is independent of all other random variables, and all have the same PDF distribution. I could have written that the random variables ##\{X_k\}_{k=1}^{2K}## are i.i.d., but because I need to separate the nominator and denominator, I used different symbols. These random variables are exponential random variables, so yes, the support is positive real. I wasn't clear about that. My mistake.

....
$$ F_{G_k}(a) = \int_0^\infty\int_0^{\eta a}f_{X_{(k)}}(x) \,dx\,f_Y(y)\,dy$$

The PDF ##f_{X_{(k)}}## of the ##k##th order statistic of ##X## can be calculated using the formula here.

The PDF ##f_{G_k}## of ##G_k## is obtained by differentiating the above expression.
Having obtained that, we calculate the CDF of ##G_1+G_2+G_3## using the triple integral you wrote above, but replacing each ##f_{Z_k}## by ##f_{G_k}##.

OK, so we can average over each ##G_k## as if the random variables are independent with no ordering, but what about the limits? I mean, ##G_3=\frac{X_{(3)}}{Y_3}## has the limits ##0## to ##\eta##, but ##G_2## contains ##X_{(2)}## which is less than or equal ##X_{(3)}## in ##G_3##. Or this doesn't affect the evaluation?
 
  • #4
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
we can average over each ##G_k## as if the random variables are independent with no ordering, but what about the limits? I mean, ##G_3=\frac{X_{(3)}}{Y_3}## has the limits ##0## to ##\eta##, but ##G_2## contains ##X_{(2)}## which is less than or equal ##X_{(3)}## in ##G_3##. Or this doesn't affect the evaluation?
You are right to point out that the new ##G_k##s are not mutually independent, because they are based on order statistics. So we have to change the integrand in your triple integral, because it is based on the Zs being independent.

Instead we'll have to do (take a deep breath!):
\begin{align*}
\text{Pr}&\left[\sum_{i=1}^3G_k\leq \eta\right]
=
\text{Pr}\left[\frac{X_{(1)}}{Y_1}+\frac{X_{(2)}}{Y_2}+\frac{X_{(3)}}{Y_3}\leq \eta\right]
\\
&=
\int_{y_1=0}^\infty \int_{x_1=0}^{y_1\eta}
\int_{y_2=0}^\infty \int_{x_2=0}^{y_2(\eta - x_1/y_1)}
\int_{y_3=0}^\infty \int_{x_3=0}^{y_3(\eta - x_1/y_1-x_2/y_2)}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
f_Y(y_1) f_Y(y_2) f_Y(y_3)
f_{X_{(1)},X_{(2)},X_{(3)}}(x_1,x_2,x_3)
\,dx_3\,dy_3\,dx_2\,dy_2\,dx_1\,dy_1
\\
&=
\int_{y_1=0}^\infty f_Y(y_1) \int_{x_1=0}^{y_1\eta}
\int_{y_2=0}^\infty f_Y(y_2) \int_{x_2=0}^{y_2(\eta - x_1/y_1)}
\int_{y_3=0}^\infty f_Y(y_3) \int_{x_3=0}^{y_3(\eta - x_1/y_1-x_2/y_2)}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
f_{X_{(1)},X_{(2)},X_{(3)}}(x_1,x_2,x_3)
\,dx_3\,dy_3\,dx_2\,dy_2\,dx_1\,dy_1
\end{align*}

It looks horrible, but it should simplify down if all your distributions are exponentials.

You can get the three-way joint probability function for the Xs from the wiki section I linked above. The formula is on the last line in the section, with ##n=3##.

EDIT: Actually the formulas in that section don't give exactly what we want, which is the joint density for the three smallest order stats, but we can infer a formula using the same principles. It should be:

$$\frac{K!}{(K-3)!}f_X(x_1)f_X(x_2)f_X(x_3) \left[1-F_X(x_3)\right]^{K-3}$$

where ##x_1\le x_2\le x_3##, otherwise zero. To enforce that inequality condition we need to change the limits for the X integrals, to get:
\begin{align*}
\text{Pr}&\left[\sum_{i=1}^3G_k\leq \eta\right]
=
\int_{y_1=0}^\infty \int_{x_1=0}^{y_1\eta}
\int_{y_2=0}^\infty \int_{x_2=x_1}^{\max(x_1,y_2(\eta - x_1/y_1))}
\int_{y_3=0}^\infty \int_{x_3=x_2}^{\max(x_2,y_3(\eta - x_1/y_1-x_2/y_2))}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
\frac{K!}{(K-3)!}
f_Y(y_1)f_Y(y_2)f_Y(y_3)
f_X(x_1)f_X(x_2)f_X(x_3) \left[1-F_X(x_3)\right]^{K-3}
\,dx_3\,dy_3\,dx_2\,dy_2\,dx_1\,dy_1
\\&=
\frac{K!}{(K-3)!}
\int_{y_1=0}^\infty f_Y(y_1) \int_{x_1=0}^{y_1\eta} f_X(x_1)
\int_{y_2=0}^\infty f_Y(y_2) \int_{x_2=x_1}^{\max(x_1,y_2(\eta - x_1/y_1))} f_X(x_2)
\int_{y_3=0}^\infty f_Y(y_3) \int_{x_3=x_2}^{\max(x_2,y_3(\eta - x_1/y_1-x_2/y_2))}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
f_X(x_3) \left[1-F_X(x_3)\right]^{K-3}
\,dx_3\,dy_3\,dx_2\,dy_2\,dx_1\,dy_1
\end{align*}
 
Last edited:
  • #5
1,367
61
Thanks. As expected. The result is nasty. Could you explain how you arrived to the integrals ( and their limits) in a little more detail?

Another but related question, can this CDF be evaluated numerically using any software tool for any value ##1\leq M\leq K## instead of 3?
 
  • #6
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
Thanks. As expected. The result is nasty. Could you explain how you arrived to the integrals ( and their limits) in a little more detail?
If all the distributions are exponentials, I think there's a reasonable chance you could evaluate this analytically, starting from the inside and working out. It's possible the max-es in the upper limits may throw a spanner in the works, but it's worth a try. If not then a good mathematical software tool should be able to numerically integrate it. If it complains about the six-deep nesting, it shouldn't be too hard to write a little program to do it, in something like Matlab, Python or R, assuming the exponential parameter and the value of ##\eta## are both known.

If we change 3 to some other M in 1...K, the same principles apply, but instead of six-deep nesting of integrals it will be 2M-deep! Maybe there's some formula that can reduce the number of integrals, but I can't see one from here.

Setting ##x_0=0## to make the pattern of limits clearer, the formula would be:

\begin{align*}
\text{Pr}&\left[\sum_{i=1}^M G_k\leq \eta\right]
=
\int_{y_1=0}^\infty \int_{x_1=x_0}^{\max\left(x_0,\,y_1\left(\eta - \sum_{k=1}^0 x_k/y_k\right)\right)}
\int_{y_2=0}^\infty \int_{x_2=x_1}^{\max\left(x_1,\,y_2\left(\eta - \sum_{k=1}^1 x_k/y_k\right)\right)}
\\
&\quad\quad\quad\quad\quad\quad\quad\quad\quad...
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad
\int_{y_M=0}^\infty \int_{x_M=x_{M-1}}^{\max\left(x_{M-1},\,y_M\left(\eta - \sum_{k=1}^{M-1} x_k/y_k\right)\right)}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
\frac{K!}{(K-M)!}
f_Y(y_1)f_Y(y_2)...f_Y(y_M)
f_X(x_1)f_X(x_2)...f_X(x_M) \left[1-F_X(x_M)\right]^{K-M}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
\,dx_M\,dy_M...\,dx_2\,dy_2\,dx_1\,dy_1
%
\\&=
\frac{K!}{(K-M)!}
\int_{y_1=0}^\infty f_Y(y_1)
\int_{x_1=0}^{\max\left(x_0,\,y_1\left(\eta - \sum_{k=1}^0 x_k/y_k\right)\right)} f_X(x_1)
\int_{y_2=0}^\infty f_Y(y_2)
\int_{x_2=x_1}^{\max\left(x_1,\,y_2\left(\eta - \sum_{k=1}^1 x_k/y_k\right)\right)} f_X(x_2)
\\
&\quad\quad\quad\quad\quad\quad\quad\quad\quad...
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad
\int_{y_M=0}^\infty f_Y(y_M)
\int_{x_M=x_{M-1}}^{\max\left(x_{M-1},\,y_M\left(\eta - \sum_{k=1}^{M-1} x_k/y_k\right)\right)} f_X(x_M)
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
\left[1-F_X(x_M)\right]^{K-M}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
\,dx_M\,dy_M...\,dx_2\,dy_2\,dx_1\,dy_1
\end{align*}

The complexity of the integral is nearly all in the limits. In the centre is just the joint PDF of all six variables. That is the product of three PDFs for the three independent Y random variable and the joint PDF for the three Xs. The latter is converted to a product of three independent X PDFs, with a combinatoric coefficient because of the different possible orderings, and the factor ##\left[1-F_X(x_M)\right]^{K-M}## that ensures that all the other ##M-K## X values are greater than ##X_{(M)}##.

The integration limits are set up from the outside in. The range of the Y variables is unconstrained. But each X variable must have an upper limit to ensure that it doesn't cause the sum of all the Gs covered by the integral signs thus far to exceed ##\eta##. Also, each X has to have a lower limit of the previous X value, to protect the required ordering of the order statistics, and that requires an adjustment to the upper limit as well, so that if the lower limit is greater than the upper limit, we get zero rather than a negative number from a backwards integral, which is a possible interpretation when the lower integral limit exceeds the upper one.
 
  • #7
1,367
61
That makes sense. The complexity of the mathematical expression is not an issue. Eventually, I need to evaluate the CDF. It doesn't matter how (except not using Monte-Carlo simulation, because I use it for verification). Of course, if a numerical solution is possible, that is great, because it would be easier to evaluate, but numerical integration is just fine. The analytical solution doesn't have a pattern in the case of no ordering, so I suspect it would have with ordering. I will try this again anyway.

Could you give me a hint on how to write a program in either MATLAB or Python to evaluate these ##2M## nested integrals? In MATLAB I think you cannot go for more than 2 or maybe three nested integrals. I need a code that can be used for any value of M.
 
  • #8
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
For the general case of M rather than 3, the depth of nesting is unknown, so it's easiest to use recursion.
I wouldn't normally do this but, since I'm learning Python, I wrote the following Python code for practice. It is untested but should do the trick if you enter the necessary PDFs for the X and Y distributions in the function skeletons below.

The integration is bog-basic rectangular integration. If you want you can sex it up to trapezoidal or even Simpson integration for a more accurate result. But unless it takes a long time to run, it's probably easier to keep it rectangular and just increase the granularity.

Code:
granularity = 1000
lim_y = 100
M = 10 # order statistic
K = 30 # number of Xs (also the number of Ys)

def pdf_y(y)
  # define pdf function for y here

def pdf_x(x)
  # define pdf function for x here

def cdf_x(x)
  # define cdf function for x here

def permut(n, k)
  # calc number of permutations of k chosen from n
  perm = n
  j = 1
  while j < k  
    perm *= n - j
    j += 1
  return perm
# end of permutation function

def integ2(eta, x_prev, sum_x, m, k)
  y = 0
  integ_y = 0
  # integrate across this y_(M-m)
  while y < lim_y
    y += dy
    lim_x = max(x_prev, y * (eta - sum_x))
    dx = (limx - x_prev) / granularity
    x = x_prev
    integ_x = 0
    # integrate across this x_(M-m)
    while x < lim_x
       x += dx
       if m > 0
         integrand = integ2(eta, x, sum_x + x / y, m - 1, k - 1))
       else
         integrand = (1 - cdf_x(x)) ^ k
       integ_x += integrand * pdf_x(x) * dx
    integ_y += integ_x * pdf_y(y) * dy
  return integ_y
# end of definition of function integ2

dy = lim_y / granularity
# call the function to find the probability
prob = permut(K, M) * integ2(eta, 0, 0, M - 1, K - 1)
print(prob)
 
  • #9
1,367
61
I will need to go over the math and the code in more details, and compare the results with those of Monte-Carlo simulations at the end. I will report back. Thanks for your responses. They were very helpful.
 
  • #10
1,367
61
@andrewkirk Apparently, for ##M\geq 3## the CDF is very complicated to be evaluated numerically, and simulations has to be done. But simulations isn't an efficient way to evaluate nested interdependent integrals. So, I am trying to evaluate a closed-form solution for the special case of ##M=2##. In particular, I want to evaluate the following CDF (note that I took another way of evaluating the integrals, only to make the approach consistent of some other work I am doing. I think both are essentially the same.)

[tex]F_Z(z) = \text{Pr}\left[\frac{X_{(1)}}{Y_1}+\frac{X_{(2)}}{Y_2}\leq z\right] = \int_{y_2}^{\infty}\int_{x_2=0}^{y_2z}\int_{y_2=0}^{\infty}\int_{x_1=0}^{\min\left(x_2,\,y_1(z-\frac{x_2}{y_2})\right)}f_{(1).(2)}(x_1,x_2)f_Y(y_1)f_Y(y_2)\,dx_1\,dx_2\,dy_1\,dy_2[/tex]

where ##f_{(1),(2)}(x_1,x_2)## is the joint PDF of the order statistics ##X_{(1)}## and ##X_{(2)}##, and ##f_Y(y_k)## is the PDF of the random variable ##Y_k##.

The joint PDF ##f_{(1),(2)}(x_1,x_2)## can be written as (all random variables are i.i.d. exponential RVs with parameter 1)

[tex]f_{(1),(2)}(x_1,x_2) = K(K-1)e^{-x_1}e^{-(K-1)x_2}[/tex]

Thus the original CDF can be written as

[tex]F_Z(z) = K(K-1)\int_{y_2}^{\infty}\int_{x_2=0}^{y_2z}\int_{y_2=0}^{\infty}\int_{x_1=0}^{\min\left(x_2,\,y_1(z-\frac{x_2}{y_2})\right)}e^{-x_1}e^{-(K-1)x_2}e^{-y_1}e^{-y_2}\,dx_1\,dx_2\,dy_1\,dy_2[/tex]

The inner most integral can be evaluate as

[tex]\int_{x_1=0}^{\min\left(x_2,\,y_1(z-\frac{x_2}{y_2})\right)}e^{-x_1}\,dx_1 = 1-\exp\left(-\min\left(x_2,\,y_1(z-\frac{x_2}{y_2})\right)\right)[/tex]

So

[tex]F_Z(z) = K(K-1)\left[I_1(z) - I_2(z)\right][/tex]

##I_1(z)## can be evaluated easily, but I got stuck in evaluating ##I_2## which is

[tex]I_2(z)=\int_{y_2}^{\infty}\int_{x_2=0}^{y_2z}\int_{y_1=0}^{\infty}\exp\left(-\min\left(x_2,\,y_1(z-\frac{x_2}{y_2})\right)\right)e^{-(K-1)x_2}e^{-y_1}e^{-y_2}\,dx_2\,dy_1\,dy_2[/tex]

I divided the integrals as follows

[tex]I_2(z) = \int_{y_2}^{\infty}\int_{y_1}^{\infty}\int_{x_2=0}^{\frac{zy_1y_2}{y_1+y_2}}e^{-Kx_2}e^{-y_1}e^{-y_2}\,dx_2\,dy_1\,dy_2+ \int_{y_2}^{\infty}\int_{y_1}^{\infty}\int_{x_2=\frac{zy_1y_2}{y_1+y_2}}^{y_2z}e^{-y_1(z-\frac{x_2}{y_2}+1)}e^{-(K-1)x_2}e^{-y_2}\,dx_2\,dy_1\,dy_2[/tex]

The limit ##\frac{zy_1y_2}{y_1+y_2}## makes the outer integrals difficult to evaluate. Is there another way or any tricks I can do to evaluate the original CDF in closed-form?
 
Last edited:
  • #11
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
I am not optimistic of being able to get a nice analytic formula (closed-form), because of the complicated, inter-related limits. But I think we can get a reasonable algorithm for numerical integration, using the same technique as we discussed recently for doing nested integrals by iteration from the inside out.

It is based on the formulas I wrote above:

\begin{align*}
\text{Pr}&\left[\sum_{i=1}^M G_k\leq \eta\right]
=
\int_{y_1=0}^\infty \int_{x_1=x_0}^{\max\left(x_0,\,y_1\left(\eta - \sum_{k=1}^0 x_k/y_k\right)\right)}
\int_{y_2=0}^\infty \int_{x_2=x_1}^{\max\left(x_1,\,y_2\left(\eta - \sum_{k=1}^1 x_k/y_k\right)\right)}
\\
&\quad\quad\quad\quad\quad\quad\quad\quad\quad...
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad
\int_{y_M=0}^\infty \int_{x_M=x_{M-1}}^{\max\left(x_{M-1},\,y_M\left(\eta - \sum_{k=1}^{M-1} x_k/y_k\right)\right)}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
\frac{K!}{(K-M)!}
f_Y(y_1)f_Y(y_2)...f_Y(y_M)
f_X(x_1)f_X(x_2)...f_X(x_M) \left[1-F_X(x_M)\right]^{K-M}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
\,dx_M\,dy_M...\,dx_2\,dy_2\,dx_1\,dy_1

\\&=
\frac{K!}{(K-M)!}
\int_{y_1=0}^\infty f_Y(y_1)
\int_{x_1=0}^{\max\left(x_0,\,y_1\left(\eta - \sum_{k=1}^0 x_k/y_k\right)\right)} f_X(x_1)
\int_{y_2=0}^\infty f_Y(y_2)
\int_{x_2=x_1}^{\max\left(x_1,\,y_2\left(\eta - \sum_{k=1}^1 x_k/y_k\right)\right)} f_X(x_2)
\\
&\quad\quad\quad\quad\quad\quad\quad\quad\quad...
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad
\int_{y_M=0}^\infty f_Y(y_M)
\int_{x_M=x_{M-1}}^{\max\left(x_{M-1},\,y_M\left(\eta - \sum_{k=1}^{M-1} x_k/y_k\right)\right)} f_X(x_M)
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
\left[1-F_X(x_M)\right]^{K-M}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
\,dx_M\,dy_M...\,dx_2\,dy_2\,dx_1\,dy_1
\end{align*}

For ##k=1,...,M## define functions
##I_k^M:\mathbb R^2\to \mathbb R## and ##J_k^M:\mathbb R^3\to \mathbb R## such that:

##I_k^M(x_{k-1},s_{k-1}) = \int_0^\infty f_Y(y_k) J_k^M(x_{k-1},s_{k-1}, y_k)\,dy_k##

##J_k^M(x_{k-1},s_{k-1}, y_k) =
\begin{cases}
0&\mbox{, if }x_{k-1} \ge y_ks_{k-1}\textrm{, else}\\
\int_{x_{k-1}}^{y_ks_{k-1}} f_X(x_k)\left(1-F_X(x_k)\right)^{K-M}\,dx_k
&\mbox{, if } k=M \textrm{, else}\\
\int_{x_{k-1}}^{y_ks_{k-1}} f_X(x_k)
I_{k+1}^M(x_k, s_{k-1} - x_k/y_k)
\,dx_k
&\mbox{, otherwise}
\end{cases} ##

Note that ##s_k=s_0 - \sum_{j=1}^k x_k/y_k##.

Then by suitably constraining and segmenting the range of possible values for ##x, y, s## so that each range contains say ##N## points for ##x,y## and ##L>N## points for ##s##, we can calculate and store these functions as ##N\times L##-shaped tables for the ##I_k^M## and ##N\times L\times N##-shaped tables for the ##J_k^M##. We start with ##J_M^M## and then work outwards, at each stage using the table just determined to do the integration for the new function. For accuracy we should use Simpson's Rule, with an extra trapezoid at the end when the number of segments in the integration region is even.

This approach will take longer to run than the one considered earlier, having the order of ##N^3L## iterations. But for any given value of ##K## and ##M##, it only needs to be run once, and only the table for ##I_1^M## needs to be retained. Then for any ##\eta\in\mathbb R^+## we can calculate:

$$\text{Pr}\left[\sum_{i=1}^M G_i\leq \eta\right]
= \frac{K!}{(K-M)!}I_1^M(0, \eta)$$

This can be programmed in a language like Python, following the same principles as the earlier example. It will be somewhat more complex because of the two extra dimensions, but eminently doable. I suggest starting with a very coarse grid until it is fully debugged, then running it overnight on a finer grid.
 
Last edited:
  • #12
1,367
61
Thanks, but I still don't get the implementation of the example you did in the other thread in R and Python without selection/ordering. I tried to start there because it is an easier case, and then come back here. But since I didn't understand the implementation there, I thought may be I can do some analytical solution for the special case ##M=2##. I found an analytic solution when the ordering includes the whole fraction ##X_k/Y_k##. In this case, I just have to consider the whole fraction as a new RV, and thus will have double integral. But when ordering is done on one of the RVs ##X_k## or ##Y_k##, it becomes much more complicated, and you need to consider the individual RVs.

Without ordering there was ##M-1## nested integrals (which would correspond to selecting ##M## out of ##K## random variables at random, and ##-1## because for the last integral we used the CDF), now there would be ##2M## nested integrals (here we need to consider the joint PDF, and we cannot use the CDFs). I get the math and how to get the limits, but I don't understand the implementation. I do for the recursion, though but it is not efficient.

So, you think even for ##M=2## there is no analytic solution? In that case, doing a numerical integration for ##M=2## is not as a contribution as for a general ##M##. Probably I should read your other posts again.

Thanks for your responses.

PS: Just a side note, wouldn't it be easier if you consider the larger RVs first? In that case the lower limits for the integrals would be 0, and the upper limits would be min instead of max. It probably would make a difference for an analytic solution, but maybe not for a numerical solution. Just a remark.
 
Last edited:
  • #13
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
Approaching it analytically with ##M=2## I can get one step further than was reached above, to where there are two integrals left to evaluate. But at that point it looks about to become horribly messy, because the next integration variable ##x_1## appears in both the numerator and denominator of exponents, as well as in the denominator of the integrand. Wolfram doesn't produce anything that looks easy to work with. FWIW, here's my development:

For ##M=2## and ##X,Y## both exponential with parameter 1, we have:
%
\begin{align*}
\text{Pr}&\left[\sum_{i=1}^M G_k\leq \eta\right]
=
\frac{K!}{(K-2)!}
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta} e^{-x_1}
\int_{y_2=0}^\infty e^{-y_2}
\int_{x_2=x_{1}}^{\max\left(x_{1},\,y_2\left(\eta - x_1/y_1\right)\right)} e^{-x_2}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
\left[1-\left(1-e^{-x_2}\right)\right]^{K-2}
\,dx_2\,dy_2\,dx_1\,dy_1 \\
% 2
&=
K(K-1)
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta} e^{-x_1}
\int_{y_2=0}^\infty e^{-y_2}
\int_{x_2=x_{1}}^{\max\left(x_{1},\,y_2\left(\eta - x_1/y_1\right)\right)} e^{-x_2}
e^{-x_2(K-2)}
\,dx_2\,dy_2\,dx_1\,dy_1 \\
% 3
&=
K(K-1)
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta} e^{-x_1}
\int_{y_2=0}^\infty e^{-y_2}
\int_{x_2=x_{1}}^{\max\left(x_{1},\,y_2\left(\eta - x_1/y_1\right)\right)}
e^{-x_2(K-1)}
\,dx_2\,dy_2\,dx_1\,dy_1
\\% 4
&=
\frac{K(K-1)}{K-1}
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta} e^{-x_1}
\int_{y_2=0}^\infty e^{-y_2}
\left[
-e^{-x_2(K-1)}
\right]_{x_2=x_{1}}^{\max\left(x_{1},\,y_2\left(\eta - x_1/y_1\right)\right)}
\,dy_2\,dx_1\,dy_1
%
\\% 5
&=
K
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta} e^{-x_1}
\int_{y_2=0}^\infty e^{-y_2}
\left(
e^{-x_1(K-1)}
-
e^{-\left(\max\left(x_{1},\,y_2\left(\eta - x_1/y_1\right)\right)\right)(K-1)}
\right)
\,dy_2\,dx_1\,dy_1
%
\\% 6
&=
K
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta} e^{-x_1}
\left(
e^{-x_1(K-1)}
-
\int_{y_2=0}^\infty e^{-y_2}
e^{-\left(\max\left(x_{1},\,y_2\left(\eta - x_1/y_1\right)\right)\right)(K-1)}
\,dy_2 \right)
\,dx_1\,dy_1
%
\\% 7
&=
K
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta} e^{-x_1}
\bigg(
e^{-x_1(K-1)}
- \int_{y_2=0}^{\frac{x_1}{\eta - x_1/y_1}} e^{-y_2}
e^{-x_1(K-1)} \,dy_2
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
- \int_{y_2={\frac{x_1}{\eta - x_1/y_1}}}^\infty e^{-y_2}
e^{-y_2\left(\eta - x_1/y_1\right)(K-1)} \,dy_2
\bigg)
\,dx_1\,dy_1
%
\\% 8
&=
K
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta} e^{-x_1}
\bigg(
e^{-x_1(K-1)}
+
e^{-x_1(K-1)}\bigg[e^{-y_2}\bigg]_{y_2=0}^{\frac{x_1}{\eta - x_1/y_1}}
\\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
- \int_{y_2={\frac{x_1}{\eta - x_1/y_1}}}^\infty
e^{-y_2\left( 1 + \left(\eta - x_1/y_1\right)(K-1)\right)} \,dy_2
\bigg)
\,dx_1\,dy_1
%
\\% 9
&=
K
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta} e^{-x_1}
\bigg(
e^{-\frac{x_1}{\eta - x_1/y_1}(K-1)}
+ \frac{
\left[
e^{-y_2\left( 1 + \left(\eta - x_1/y_1\right)(K-1)\right)}
\right]_{y_2={\frac{x_1}{\eta - x_1/y_1}}}^\infty
}{1 + \left(\eta - x_1/y_1\right)(K-1)}
\bigg)
\,dx_1\,dy_1
%
\\% 10
&=
K
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta} e^{-x_1}
\bigg(
e^{-\frac{x_1}{\eta - x_1/y_1}(K-1)}
- \frac{
e^{-{{{\frac{x_1}{\eta - x_1/y_1}}} }\left( 1 + \left(\eta - x_1/y_1\right)(K-1)\right)}
}{1 + \left(\eta - x_1/y_1\right)(K-1)}
\bigg)
\,dx_1\,dy_1
%
\\% 11
&=
K
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta}
\bigg(
e^{-x_1-\frac{x_1}{\eta - x_1/y_1}(K-1)}
+
\frac{
e^{-\frac{x_1}{\eta - x_1/y_1} - x_1(K-1)}
}{1 + \left(\eta - x_1/y_1\right)(K-1)}
\bigg)
\,dx_1\,dy_1
%
\\% 12
&=
K
\int_{y_1=0}^\infty e^{-y_1}
\int_{x_1=0}^{y_1\eta}
e^{-\frac{x_1}{\eta - x_1/y_1}(K-1)}
\bigg(
e^{-x_1}
+
\frac{
e^{-x_1(K-1)}
}{1 + \left(\eta - x_1/y_1\right)(K-1)}
\bigg)
\,dx_1\,dy_1
\end{align*}

That's the point where I usually give up and take the numeric approach.
 
  • #14
1,367
61
I reached to almost a similar result. You divided the integral over ##y_2##, while I divided the integral over ##x_2##. It seems that numerical integration is the way to go.
 
  • #15
1,367
61
I tried to do the numerical integrals for your results. If we write your result as ##K(I_1(\eta)+I_2(\eta)))##, then the first integral ##I_1(\eta)## was evaluated OK, but the second ##I_2(\eta)## gives me a division by zero error. It gives me the same error when I divide the integral over ##x_2## instead of ##y_2##!!
 
  • #16
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
I tried to do the numerical integrals for your results. If we write your result as ##K(I_1(\eta)+I_2(\eta)))##, then the first integral ##I_1(\eta)## was evaluated OK, but the second ##I_2(\eta)## gives me a division by zero error. It gives me the same error when I divide the integral over ##x_2## instead of ##y_2##!!
I expect that's because the denominator ##(\eta - x_1/y_1)## of the exponent of the factor ##e^{-\frac{x_1}{\eta - x_1/y_1}(K-1)}## in the inner integrand goes to zero at the upper inner integration limit. The integral is an improper integral, but the integrand can be made continuous. You should be able to get around the problem by changing the integrand to be a branched function that gives zero if ##x_1 = y_1\eta## and what is written otherwise.
 
  • #17
1,367
61
You are right. When I accounted for the case ##\eta - x_1/y_1 = 0## I got the result without any warning or errors. I thought the software takes care of this, since ##e^{-\infty} = 0##! Apparently not. Thanks
 
  • #18
1,367
61
For the general case of M rather than 3, the depth of nesting is unknown, so it's easiest to use recursion.
I wouldn't normally do this but, since I'm learning Python, I wrote the following Python code for practice. It is untested but should do the trick if you enter the necessary PDFs for the X and Y distributions in the function skeletons below.

The integration is bog-basic rectangular integration. If you want you can sex it up to trapezoidal or even Simpson integration for a more accurate result. But unless it takes a long time to run, it's probably easier to keep it rectangular and just increase the granularity.

Code:
granularity = 1000
lim_y = 100
M = 10 # order statistic
K = 30 # number of Xs (also the number of Ys)

def pdf_y(y)
  # define pdf function for y here

def pdf_x(x)
  # define pdf function for x here

def cdf_x(x)
  # define cdf function for x here

def permut(n, k)
  # calc number of permutations of k chosen from n
  perm = n
  j = 1
  while j < k
    perm *= n - j
    j += 1
  return perm
# end of permutation function

def integ2(eta, x_prev, sum_x, m, k)
  y = 0
  integ_y = 0
  # integrate across this y_(M-m)
  while y < lim_y
    y += dy
    lim_x = max(x_prev, y * (eta - sum_x))
    dx = (limx - x_prev) / granularity
    x = x_prev
    integ_x = 0
    # integrate across this x_(M-m)
    while x < lim_x
       x += dx
       if m > 0
         integrand = integ2(eta, x, sum_x + x / y, m - 1, k - 1))
       else
         integrand = (1 - cdf_x(x)) ^ k
       integ_x += integrand * pdf_x(x) * dx
    integ_y += integ_x * pdf_y(y) * dy
  return integ_y
# end of definition of function integ2

dy = lim_y / granularity
# call the function to find the probability
prob = permut(K, M) * integ2(eta, 0, 0, M - 1, K - 1)
print(prob)
Have you run this code? Why do you pass M-1 and not M to the original function? Also why do you pass K-1 (it is not needed)? I am trying to implement the mathematical integration numerically in Python (I did it in a slightly different way), but it is not working. I am trying to figure out why. I wrote the recursive function mathematically as the following:

[tex]F_{m-1}(z_{m-2}) = \int_{y_{m-1}=0}^{\infty}\int_{x_{m-1}=x_{m-2}}^{\max(x_{m-2},\,y_{m-1}\,z_{m-2})}F_m\left(z_{m-2}-\frac{x_{m-1}}{y_{m-1}}\right)f_X(x_{m-1})f_Y(y_{m-1})\,dx_{m-1}\,dy_{m-1}[/tex]

where

[tex]F_{M}(z_{M-1}) = \int_{y_{M}=0}^{\infty}\int_{x_{M}=x_{M-1}}^{\max(x_{M-1},\,y_{M}\,z_{M-1})}\left[1-F_X(x_M)\right]^{K-M}f_X(x_{M})f_Y(y_{M})\,dx_{M}\,dy_{M}[/tex]

So, we need to find ##\frac{K!}{(K-M)!}F_1(z_0)##.

The code corresponding to this recursive function is

Python:
import math
K=3
M=2
dx = 0.01
dy = 0.01
y_lim = 100

#the CDF of X
def F(x):
    return math.exp(-x*(K-M))

#The pdf of Xs and Ys
def f(u):
    return math.exp(-u)

def G(z, m, x_prev_lim):
    if m == M:
        res = 0
        ux = x_prev_lim
        uy = 0
        while uy < y_lim:
            uy += dy
            resx = 0
            while ux <= max(x_prev_lim, uy*z):
                ux += dx
                resx += F(ux)*f(ux)*dx
            res += resx*f(uy)*dy
 
    else:
        res = 0
        ux = x_prev_lim
        uy = 0
        while uy < y_lim:
            uy += dy
            resx = 0
            while ux <= max(x_prev_lim, uy*z):
                ux += dx
                resx += G(z - (ux/uy), m+1, ux)*f(ux)*dx
            res += resx*f(uy)*dy
    return res

print((math.factorial(K)/math.factorial(K-M))*G(1, 1, 0))
Where did I make a mistake?
 
Last edited:
  • #19
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
Why do you pass M-1 and not M to the original function?
Because there are M nested integrals if the innermost integrand is a PDF, but in the above code the innermost integrand is a CDF, which is the integral of a PDF, so we use only M-1 nested integrals.

Here is that innermost integrand:
Code:
integrand = (1 - cdf_x(x)) ^ k
Also why do you pass K-1 (it is not needed)?
The K is used as the exponent in that innermost integrand.

I am away from home till Monday, and hence away from my Python compiler. But I hope the above is of some help.
 
  • #20
1,367
61
OK, that makes sense when the inner most double-integrals (##F_M(z_{M-1})## in the above equation) are evaluated numerically. Otherwise, the CDF of ##X## (which appears what you are referring to) is part of the joint PDF of the order statistics. And thus we still have ##M## nested double-integrals, right?

Regarding the argument ##k-1##, in the mathematical formulation, ##K## appears only once in evaluating ##F_M(z_{M-1})##, and it is constant. I wonder why you included it as an argument.
 
Last edited:
  • #21
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
Regarding the argument ##k-1##, in the mathematical formulation, ##K## appears only once in evaluating ##F_M(z_{M-1})##, and it is constant. I wonder why you included it as an argument.
When it is used, its value is K-M, because k starts at M-1 and is decremented by 1 in each recursive call, of which there are K-1.
Two alternative ways we could have done this are:
  • pass K-M as a parameter, all the way down through the recursion and use it as the exponent in the base case, or
  • not pass it as a parameter and instead use the global variables K and M in the base case. I try to avoid doing that, as it is considered bad programming practice (not 'modular') to use variables inside a function that were not passed as parameters. But I see I have already broken that principle by using the globals dy and lim_y
 
  • #22
1,367
61
OK, I see what you were doing. I defined M and K as global variables in my code, that is why I didn't need them as arguments. Otherwise, the code is clear, and I used it with some modifications based on my formulation. The logic seems fine, but the final result is way smaller than that of the corresponding Monte-Carlo simulations!! Is approximating the upper limit ##\infty## for the Y RVs with 100 is good enough? I went to 1000, but the running time was way too long.
 
  • #23
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
The way to make this tractable is to define a function ##H_k## that is the composition of the functions ##I^M_k## and ##J^M_k## that I defined in post 11, then the integral is the composition
$$\frac{K!}{(K-M)!} H_1 \circ H_2 \circ H_3 \circ ....... \circ H_M (0,\eta)$$

The function ##H_k## is given by:

$$H_k(x_{k-1},s_{k-1}) = \int_0^\infty f_Y(y_k) \int_{x_k=x_{k-1}}^{\max(x_{k-1},y_k s_{k-1})} f_X(x_k) H_{k+1} (x_k, s_{k-1}-x_k/y_k)\,dx_k\,dy_k$$
if ##k<M## and
$$H_k(x_{k-1},s_{k-1}) = \int_0^\infty f_Y(y_k) \int_{x_k=x_{k-1}}^{\max(x_{k-1},y_k s_{k-1})} f_X(x_k)
(1-F_X(x_M))^{K-M}
\,dx_k\,dy_k$$
otherwise.

Start by numerically evaluating values of ##H_M(x_{M-1},s_{M-1})## for a reasonably finely grained 2D table of ##x_{M-1},s_{M-1}##.

Then for each ##k## from ##M-1## down to 1, use the 2D table from the previous step to calculate a new 2D table of values for
##H_k(x_{k-1},s_{k-1})##.

For the last case of ##k=1## you need only calculate a 1D table, setting ##x_0=0##. The first parameter ##s_0## is the outermost integration limit ##\eta##.

Then multiply that last (1D) table of ##H_1(0,\eta)## by ##K!/(K-M)!## to get the vector of definite integrals that you seek.

You should be able to choose suitable granularity and integration limits (for ##y_k##) such that the calculation runs fast and accurately.

I recommend you use Simpson integration, as discussed previously. Where there is an even number of intervals to be integrated, use Simpson with an extra trapezium on the end.
 
  • #24
1,367
61
My formulation in post #18 is very similar to yours, with some change of variable names. I also go from the inner most integral to the outer most from m=1 up to M. I think this shouldn't change the final result.

Indeed the code runs very slow for the rectangular rule, even for M=2, especially when the step size is 0.001. I need to make it work first, and then I will optimize it. Creating the tables and the Simpson's rule should improve the running time, but I feel they will complicate the code, while it is not working.

How would you write a Python code for your formulation, without creating tables and using the rectangular rule? Just to keep things simple, and clear for now. Is your code I quoted in post #18 still valid?
 
  • #25
andrewkirk
Science Advisor
Homework Helper
Insights Author
Gold Member
3,792
1,390
Is approximating the upper limit ∞\infty for the Y RVs with 100 is good enough? I went to 1000, but the running time was way too long.
The upper integration limit has very little effect on the accuracy, because the density function gets so small beyond a limit of about 7. So big integration limits just ramp up runtime with negligible improvement in accuracy.
What has a huge effect on accuracy is the step size dy near 0, because y is a denominator and small changes in that make big changes in the ratio x/y.

I suggest trying it with:
  • upper limit of ##L=7## for both x and y.
  • dx constant at ##\delta=0.01##
  • dy scaled harmonically so that ##1/y - 1/(y+dy)<\delta##. For each y integration, you will calculate a vector of values that are the endpoints of the integration intervals. The intervals will get progressively wider as y increases. That way the finest granularity is applied only where it's needed, near 0.
  • For all but the outer integral over y, set the lower integration limit to be ##x_{k-1}/z_{k-1}##, since the x-integral inside the y-integral will be zero where y is below that.
  • For the outer integral over y, set the lower integration limit to ##\varepsilon=0.0001##
You could try varying the values of ##\delta, \varepsilon## and ##L## to get an optimal balance between runtime and accuracy.
Is your code I quoted in post #18 still valid?
Yes it's valid, but it will run either slowly or inaccurately, because of the issues identified above.
 

Related Threads on Order Statistics: CDF

  • Last Post
Replies
1
Views
6K
  • Last Post
Replies
5
Views
4K
  • Last Post
Replies
9
Views
3K
  • Last Post
Replies
4
Views
2K
  • Last Post
Replies
11
Views
3K
  • Last Post
Replies
8
Views
749
  • Last Post
Replies
2
Views
2K
Replies
2
Views
1K
  • Last Post
Replies
1
Views
1K
  • Last Post
Replies
4
Views
5K
Top