Extreme value theory and limiting distributions for i.i.d. order statistics

jamie_m · Jan 10, 2012

(This question was previously posted to sci.math.research. I only received one reply; sadly the advice therein conflicted with section 9.1 of H.A. David's "Order Statistics" - and probably with the fact that there was such a field of study as "r-extreme order statistics" - hence my reposting it here.)

I've been studying some cryptographic research in which the asymptotic normal distribution of the empirical sample quartile of order q is used to construct statistical models of the amount of data required for a successful cryptanalysis.

The main issue I have is that, while I'm pretty sure that such models have continued to be used for order statistics X_i (with i near to n) where the asymptotic normal distribution is inaccurate and where something based on extreme-value theory for the mth extremes would have been better, I don't have any idea as to how to compute an estimate for the value of i (or indeed q) above which the asymptotic normal might be considered suspect.

As an example, I'm currently dealing with the situation X_1 ≤ X_2 ...≤ X_n, where n = 2^{41}-1 = 2,199,023,255,551. In particular, I'm trying to work out whether the asymptotic normal is likely to be adequate when drawing conclusions about the top 2^{17} = 131,072 values or not - and while this seems a high m for m-th-extreme, it's not so high in relation to n, and this would mean I was dealing with the top 0.000006388% of values.

Can anyone give me some advice here?

Many thanks,

James McLaughlin.

bpet · Jan 14, 2012

There's a few options depending on how much is known about the distribution that your data is drawn from, e.g. you haven't said yet whether it's discrete or continuous, bounded, has a parametric form or known tail asymptotics.

Also it might be easier to consider the empirical cdf (which are scaled binomial variables) rather than the empirical quantiles, especially if the original problem can be reformulated in terms of exceedance properties.

If you have parametric or tail info then there's a few results in Embrechts' book that may be useful.

HTH

jamie_m · Jan 25, 2012

(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?

Number Nine · Jan 25, 2012

jamie_m said:

(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?

bpet is almost certainly referring to Embrechts' Modelling Extremal Events. You can find it here.

bpet · Jan 25, 2012

jamie_m said:

(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?

Ok, the data's distribution is a known parametric form so it should be no problem to write down the pdf of the m-th largest data point. From there it should be possible to derive the normal approximation directly (may require an asymptotic expression for the incomplete gamma function) or even a higher-order approximation.

The empirical cdf is more useful for the situation where the data is from an unknown distribution. Also I think I meant to write "exceedance probabilities", not "exceedance properties".

HTH

jamie_m · Jan 26, 2012

"Ok, the data's distribution is a known parametric form so it should be no problem to write down the pdf of the m-th largest data point." you state.

Let me see if I'm doing this right so far:

The CDF of the mth largest data point, where D is the number of data points, and where P(X ≤ x) denotes the CDF of the chi-squared distribution, is:

D
= Ʃ ((P(X ≤ x)^{i})(1-P(X ≤ x))^{D - i})(D \choose i)
i=(D - (m-1))

To obtain the PDF, I need to differentiate that using the chain rule and the product rule - this being made easier since I can express this PDF in terms of the PDF and CDF of the chi-square distribution.

Probably through numerical integration with Mathematica, I use that to obtain the mean and variance. (Hopefully (D \choose i) won't involve numbers too large for it to handle)

If there's no simpler method, and if that's what I need to do, I don't see how to obtain a Normal approximation for that. Is there somewhere I can use the Central Limit Theorem that I haven't spotted?

Stephen Tashi · Jan 27, 2012

jamie_m said:

I don't see how to obtain a Normal approximation for that.

I don't know anything about this problem, so my remarks are naive questions - not attempts to give some subtle hint about how to solve it.

How much of the asymptotic theory of order statistics is based on approximating the binomial distribution as a normal distribution?

If the cdf for one sample of random variable is F(x) and the pdf is f(x). We ask "In N samples, what is the probability that mth largest value is v?

One might hope that the probability can be computed as the two factors (probability that one sample is v)(probability that in remaining samples, m-1 are less than v and N-m are greater). The second factor could be viewed as a computation from a binomial distribution of N-1 samples, with probabiity F(v) of success.

Unlike the usual scenario for approximating the binomial by a normal, this approximation would involve approximating the density of the mth order statistic by a different normal distribution at each possible value for it. (It also involves a problematic event in the first factor, but often one can get away with misinterpreting f(v) that way.)

I have no trouble finding things on the web about the results of extreme value theory, but I haven't found any good explanations of how these results are derrived.

Extreme value theory and limiting distributions for i.i.d. order statistics

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad Understanding permutations and combinations in a coin toss experiment

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect