Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

I Order Statistics: CDF

  1. Jul 13, 2018 at 10:41 AM #1
    Suppose I have the random variables ##Z_k=X_k/Y_k## with a PDF ##f_{Z_k}(z_k)## for ##k=1,\,2\,\ldots, K##, where ##\{X_k, Y_k\}## are i.i.d. random variables. I can find

    [tex]\text{Pr}\left[\sum_{i=1}^3Z_k\leq \eta\right][/tex]

    as

    [tex]\int_{z_3=0}^{\eta}\int_{z_2=0}^{\eta-z_3}\int_{z_1=0}^{\eta-z_3-z_2}f_{Z_1}(z_1)f_{Z_2}(z_2)f_{Z_3}(z_3)\,dz_1dz_2dz_3[/tex]

    Now suppose I arrange the random variables ##\{X_k\}_{k=1}^K## as

    [tex]X_{(1)}\leq X_{(2)}\leq \cdots \leq X_{(K)}[/tex]

    and define ##G_k=X_{(k)}/Y_{i_k}##, where ##i_k## is the index of the ##k^{\text{th}}## order statistic.

    How can I find

    [tex]\text{Pr}\left[\sum_{i=1}^3G_k\leq \eta\right][/tex]
     
  2. jcsd
  3. Jul 13, 2018 at 8:54 PM #2

    andrewkirk

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member

    This doesn't make it clear what is identically distributed to what, or what is independent of what.

    The maximal interpretation is that all ##X_k## and all ##Y_k## have the same distribution, with PDF ##f_X##, and that the set of random variables ##\{X_1,...,X_K,Y_1,...,Y_K\}## is pairwise independent. Or it could mean that ##X_k## has the same distribution as ##Y_k## and the two are independent of each other, but the distributions may differ from that of ##X_j## for ##j\neq k##, and there may be dependences between ##X_k## and ##X_j##. There are a variety of possible interpretations.

    I will assume that the set of all ##2K## random variables is pairwise independent, and that all ##X_k## have the same PDF ##f_X## and all the ##Y_k## have the same PDF ##f_Y##, but the two PDFs may differ from one another. I will also assume that the support of all the random variables is the positive reals, which is implied by the integral you have written.

    With those assumptions, the distribution of ##X_{(k)}/Y_{i_k}## is the same as that of ##X_{(k)}/Y_{j}## for any ##j##. Then the CDF of ##G_k## satisfies:

    $$ F_{G_k}(a) = \int_0^\infty\int_0^{\eta a}f_{X_{(k)}}(x) \,dx\,f_Y(y)\,dy$$

    The PDF ##f_{X_{(k)}}## of the ##k##th order statistic of ##X## can be calculated using the formula here.

    The PDF ##f_{G_k}## of ##G_k## is obtained by differentiating the above expression.
    Having obtained that, we calculate the CDF of ##G_1+G_2+G_3## using the triple integral you wrote above, but replacing each ##f_{Z_k}## by ##f_{G_k}##.
     
  4. Jul 13, 2018 at 10:25 PM #3
    What I meant by ##\{X_k, Y_k\}## of being i.i.d. is that every single random variable is independent of all other random variables, and all have the same PDF distribution. I could have written that the random variables ##\{X_k\}_{k=1}^{2K}## are i.i.d., but because I need to separate the nominator and denominator, I used different symbols. These random variables are exponential random variables, so yes, the support is positive real. I wasn't clear about that. My mistake.


    OK, so we can average over each ##G_k## as if the random variables are independent with no ordering, but what about the limits? I mean, ##G_3=\frac{X_{(3)}}{Y_3}## has the limits ##0## to ##\eta##, but ##G_2## contains ##X_{(2)}## which is less than or equal ##X_{(3)}## in ##G_3##. Or this doesn't affect the evaluation?
     
  5. Jul 14, 2018 at 12:33 AM #4

    andrewkirk

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member

    You are right to point out that the new ##G_k##s are not mutually independent, because they are based on order statistics. So we have to change the integrand in your triple integral, because it is based on the Zs being independent.

    Instead we'll have to do (take a deep breath!):
    \begin{align*}
    \text{Pr}&\left[\sum_{i=1}^3G_k\leq \eta\right]
    =
    \text{Pr}\left[\frac{X_{(1)}}{Y_1}+\frac{X_{(2)}}{Y_2}+\frac{X_{(3)}}{Y_3}\leq \eta\right]
    \\
    &=
    \int_{y_1=0}^\infty \int_{x_1=0}^{y_1\eta}
    \int_{y_2=0}^\infty \int_{x_2=0}^{y_2(\eta - x_1/y_1)}
    \int_{y_3=0}^\infty \int_{x_3=0}^{y_3(\eta - x_1/y_1-x_2/y_2)}
    \\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
    f_Y(y_1) f_Y(y_2) f_Y(y_3)
    f_{X_{(1)},X_{(2)},X_{(3)}}(x_1,x_2,x_3)
    \,dx_3\,dy_3\,dx_2\,dy_2\,dx_1\,dy_1
    \\
    &=
    \int_{y_1=0}^\infty f_Y(y_1) \int_{x_1=0}^{y_1\eta}
    \int_{y_2=0}^\infty f_Y(y_2) \int_{x_2=0}^{y_2(\eta - x_1/y_1)}
    \int_{y_3=0}^\infty f_Y(y_3) \int_{x_3=0}^{y_3(\eta - x_1/y_1-x_2/y_2)}
    \\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
    f_{X_{(1)},X_{(2)},X_{(3)}}(x_1,x_2,x_3)
    \,dx_3\,dy_3\,dx_2\,dy_2\,dx_1\,dy_1
    \end{align*}

    It looks horrible, but it should simplify down if all your distributions are exponentials.

    You can get the three-way joint probability function for the Xs from the wiki section I linked above. The formula is on the last line in the section, with ##n=3##.

    EDIT: Actually the formulas in that section don't give exactly what we want, which is the joint density for the three smallest order stats, but we can infer a formula using the same principles. It should be:

    $$\frac{K!}{(K-3)!}f_X(x_1)f_X(x_2)f_X(x_3) \left[1-F_X(x_3)\right]^{K-3}$$

    where ##x_1\le x_2\le x_3##, otherwise zero. To enforce that inequality condition we need to change the limits for the X integrals, to get:
    \begin{align*}
    \text{Pr}&\left[\sum_{i=1}^3G_k\leq \eta\right]
    =
    \int_{y_1=0}^\infty \int_{x_1=0}^{y_1\eta}
    \int_{y_2=0}^\infty \int_{x_2=x_1}^{\max(x_1,y_2(\eta - x_1/y_1))}
    \int_{y_3=0}^\infty \int_{x_3=x_2}^{\max(x_2,y_3(\eta - x_1/y_1-x_2/y_2))}
    \\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
    \frac{K!}{(K-3)!}
    f_Y(y_1)f_Y(y_2)f_Y(y_3)
    f_X(x_1)f_X(x_2)f_X(x_3) \left[1-F_X(x_3)\right]^{K-3}
    \,dx_3\,dy_3\,dx_2\,dy_2\,dx_1\,dy_1
    \\&=
    \frac{K!}{(K-3)!}
    \int_{y_1=0}^\infty f_Y(y_1) \int_{x_1=0}^{y_1\eta} f_X(x_1)
    \int_{y_2=0}^\infty f_Y(y_2) \int_{x_2=x_1}^{\max(x_1,y_2(\eta - x_1/y_1))} f_X(x_2)
    \int_{y_3=0}^\infty f_Y(y_3) \int_{x_3=x_2}^{\max(x_2,y_3(\eta - x_1/y_1-x_2/y_2))}
    \\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
    f_X(x_3) \left[1-F_X(x_3)\right]^{K-3}
    \,dx_3\,dy_3\,dx_2\,dy_2\,dx_1\,dy_1
    \end{align*}
     
    Last edited: Jul 14, 2018 at 1:51 AM
  6. Jul 14, 2018 at 1:17 AM #5
    Thanks. As expected. The result is nasty. Could you explain how you arrived to the integrals ( and their limits) in a little more detail?

    Another but related question, can this CDF be evaluated numerically using any software tool for any value ##1\leq M\leq K## instead of 3?
     
  7. Jul 14, 2018 at 2:01 AM #6

    andrewkirk

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member

    If all the distributions are exponentials, I think there's a reasonable chance you could evaluate this analytically, starting from the inside and working out. It's possible the max-es in the upper limits may throw a spanner in the works, but it's worth a try. If not then a good mathematical software tool should be able to numerically integrate it. If it complains about the six-deep nesting, it shouldn't be too hard to write a little program to do it, in something like Matlab, Python or R, assuming the exponential parameter and the value of ##\eta## are both known.

    If we change 3 to some other M in 1...K, the same principles apply, but instead of six-deep nesting of integrals it will be 2M-deep! Maybe there's some formula that can reduce the number of integrals, but I can't see one from here.

    Setting ##x_0=0## to make the pattern of limits clearer, the formula would be:

    \begin{align*}
    \text{Pr}&\left[\sum_{i=1}^M G_k\leq \eta\right]
    =
    \int_{y_1=0}^\infty \int_{x_1=x_0}^{\max\left(x_0,\,y_1\left(\eta - \sum_{k=1}^0 x_k/y_k\right)\right)}
    \int_{y_2=0}^\infty \int_{x_2=x_1}^{\max\left(x_1,\,y_2\left(\eta - \sum_{k=1}^1 x_k/y_k\right)\right)}
    \\
    &\quad\quad\quad\quad\quad\quad\quad\quad\quad...
    \\&\quad\quad\quad\quad\quad\quad\quad\quad\quad
    \int_{y_M=0}^\infty \int_{x_M=x_{M-1}}^{\max\left(x_{M-1},\,y_M\left(\eta - \sum_{k=1}^{M-1} x_k/y_k\right)\right)}
    \\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
    \frac{K!}{(K-M)!}
    f_Y(y_1)f_Y(y_2)...f_Y(y_M)
    f_X(x_1)f_X(x_2)...f_X(x_M) \left[1-F_X(x_M)\right]^{K-M}
    \\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
    \,dx_M\,dy_M...\,dx_2\,dy_2\,dx_1\,dy_1
    %
    \\&=
    \frac{K!}{(K-M)!}
    \int_{y_1=0}^\infty f_Y(y_1)
    \int_{x_1=0}^{\max\left(x_0,\,y_1\left(\eta - \sum_{k=1}^0 x_k/y_k\right)\right)} f_X(x_1)
    \int_{y_2=0}^\infty f_Y(y_2)
    \int_{x_2=x_1}^{\max\left(x_1,\,y_2\left(\eta - \sum_{k=1}^1 x_k/y_k\right)\right)} f_X(x_2)
    \\
    &\quad\quad\quad\quad\quad\quad\quad\quad\quad...
    \\&\quad\quad\quad\quad\quad\quad\quad\quad\quad
    \int_{y_M=0}^\infty f_Y(y_M)
    \int_{x_M=x_{M-1}}^{\max\left(x_{M-1},\,y_M\left(\eta - \sum_{k=1}^{M-1} x_k/y_k\right)\right)} f_X(x_M)
    \\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
    \left[1-F_X(x_M)\right]^{K-M}
    \\&\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
    \,dx_M\,dy_M...\,dx_2\,dy_2\,dx_1\,dy_1
    \end{align*}

    The complexity of the integral is nearly all in the limits. In the centre is just the joint PDF of all six variables. That is the product of three PDFs for the three independent Y random variable and the joint PDF for the three Xs. The latter is converted to a product of three independent X PDFs, with a combinatoric coefficient because of the different possible orderings, and the factor ##\left[1-F_X(x_M)\right]^{K-M}## that ensures that all the other ##M-K## X values are greater than ##X_{(M)}##.

    The integration limits are set up from the outside in. The range of the Y variables is unconstrained. But each X variable must have an upper limit to ensure that it doesn't cause the sum of all the Gs covered by the integral signs thus far to exceed ##\eta##. Also, each X has to have a lower limit of the previous X value, to protect the required ordering of the order statistics, and that requires an adjustment to the upper limit as well, so that if the lower limit is greater than the upper limit, we get zero rather than a negative number from a backwards integral, which is a possible interpretation when the lower integral limit exceeds the upper one.
     
  8. Jul 14, 2018 at 8:56 AM #7
    That makes sense. The complexity of the mathematical expression is not an issue. Eventually, I need to evaluate the CDF. It doesn't matter how (except not using Monte-Carlo simulation, because I use it for verification). Of course, if a numerical solution is possible, that is great, because it would be easier to evaluate, but numerical integration is just fine. The analytical solution doesn't have a pattern in the case of no ordering, so I suspect it would have with ordering. I will try this again anyway.

    Could you give me a hint on how to write a program in either MATLAB or Python to evaluate these ##2M## nested integrals? In MATLAB I think you cannot go for more than 2 or maybe three nested integrals. I need a code that can be used for any value of M.
     
  9. Jul 14, 2018 at 11:11 PM #8

    andrewkirk

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member

    For the general case of M rather than 3, the depth of nesting is unknown, so it's easiest to use recursion.
    I wouldn't normally do this but, since I'm learning Python, I wrote the following Python code for practice. It is untested but should do the trick if you enter the necessary PDFs for the X and Y distributions in the function skeletons below.

    The integration is bog-basic rectangular integration. If you want you can sex it up to trapezoidal or even Simpson integration for a more accurate result. But unless it takes a long time to run, it's probably easier to keep it rectangular and just increase the granularity.

    Code (Text):
    granularity = 1000
    lim_y = 100
    M = 10 # order statistic
    K = 30 # number of Xs (also the number of Ys)

    def pdf_y(y)
      # define pdf function for y here

    def pdf_x(x)
      # define pdf function for x here

    def cdf_x(x)
      # define cdf function for x here

    def permut(n, k)
      # calc number of permutations of k chosen from n
      perm = n
      j = 1
      while j < k  
        perm *= n - j
        j += 1
      return perm
    # end of permutation function

    def integ2(eta, x_prev, sum_x, m, k)
      y = 0
      integ_y = 0
      # integrate across this y_(M-m)
      while y < lim_y
        y += dy
        lim_x = max(x_prev, y * (eta - sum_x))
        dx = (limx - x_prev) / granularity
        x = x_prev
        integ_x = 0
        # integrate across this x_(M-m)
        while x < lim_x
           x += dx
           if m > 0
             integrand = integ2(eta, x, sum_x + x / y, m - 1, k - 1))
           else
             integrand = (1 - cdf_x(x)) ^ k
           integ_x += integrand * pdf_x(x) * dx
        integ_y += integ_x * pdf_y(y) * dy
      return integ_y
    # end of definition of function integ2

    dy = lim_y / granularity
    # call the function to find the probability
    prob = permut(K, M) * integ2(eta, 0, 0, M - 1, K - 1)
    print(prob)
     
     
  10. Jul 15, 2018 at 12:01 AM #9
    I will need to go over the math and the code in more details, and compare the results with those of Monte-Carlo simulations at the end. I will report back. Thanks for your responses. They were very helpful.
     
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted